Ganglia Web - Hosts Up and Hosts Down Issue - ganglia

I have set up Ganglia(Ganglia Core 3.6.0 and Ganglia Web 3.5.10) to monitor my cluster.
When gmond is restarted in a machine, metrics from all other gmond machines also gets stopped ie I am not able to see metrics getting published from other machines in Ganglia Web. And I can also see Hosts up going to 0 and Hosts down as 13(total number of machines). As time goes, the Hosts up comes back to 13.
Am I missing something ?? Can some one help me...

If it's always the same machine, it should be a gmond 'end-point'. The gmetad daemon is querying only one gmond (no redundancy), if he goes down everybody seems to be going down.
If there are a redundancy (eg. more than one host in a data source), you can expect some lag if the first one goes down because of the number of TCP queries before it timesout.

Related

Cassandra Kubernetes Statefulset NoHostAvailableException

I have an application deployed in kubernetes, it consists of cassandra, a go client, and a java client (and other things, but they are not relevant for this discussion).
We have used helm to do our deployment.
We are using a stateful set and a headless service for cassandra.
We have configured the clients to use the headless service dns as a contact point for cluster creation.
Everything works great.
Until all of the nodes go down, or some other nefarious combination of nodes going down, I am simulating it by deleting all pods using kubectl delete in succession on all of the cassandra nodes.
When I do this the clients throw NoHostAvailableException
in java its
"java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.200.23.151:9042 (com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency LOCAL_QUORUM (1 required but only 0 alive)), /10.200.152.130:9042 (com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency ONE (1 required but only 0 alive)))"
which eventually becomes
"java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried)"
in go its
"gocql: no hosts available in the pool"
I can query cassandra using cqlsh, the node seems fine using nodetool status, all of the new ips are there
the image I am using doesnt have netstat so I have not yet confirmed its listening on the expected port.
Via executing bash on the two client pods I can see the dns makes sense using nslookup, but...
netstat does not show any established connections to cassandra (they are present before I take the nodes down)
If I restart my clients everything works fine.
I have googled a lot (I mean a lot), most of what I have found is related to never having a working connection, the most relevant things seem very old (like 2014, 2016).
So a node going down is very basic and I would expect everything to work, the cassandra cluster manages itself, it discovers new nodes as they come online, it balances the load, etc. etc.
If I take my all of my cassandra nodes down slowly, one at a time, everything works fine (I have not confirmed that the load is distributed appropriately and to the correct node, but at least it works)
So, is there a point where this behaviour is expected? ie I have taken everything down, nothing was up and running before the last from the first cluster was taken down.. is this behaviour expected?
To me it seems like it should be an easy issue to resolve, not sure whats missing / incorrect, I am surprised that both clients show the same symptoms, makes me think something is not happening with our statefulset and service
I think the problem might lie in the headless DNS service. If all of the nodes go down completely and there are no nodes at all available via the service until pods are replaced, it could cause the driver to hang.
I've noted that you've used Helm for your deployments but you may be interested in this document for connecting to Cassandra clusters in Kubernetes from the authors of the cass-operator.
I'm going to contact some of the authors and get them to respond here. Cheers!

Two versions of fluentd fighting over port in my cluster

Somehow, I have 2 versions of fluentd running in my cluster:
They end up fighting over the same port, they just keep cranking away, trying to start up on that port, and it saturates all the CPU in the cluster.
unexpected error error_class=Errno::EADDRINUSE error="Address already in use - bind(2) for 0.0.0.0:24231
/opt/google-fluentd/embedded/lib/ruby/2.6.0/socket.rb:201:in 'bind'
I've tried deleting the daemon sets and deployments, they just keep coming back. Also tried ssh'ing into the machines and killing the process on that port. Nothing seems to work.
Obviously, I only want one version of fluentd to run (and I'm not even sure which one).
I seem to have fixed it. I went to GCP dashboard cluster edit page, Kubernetes Engine Monitoring dropdown was blank. It seems not even the dropdown could decide what to display here.
It seems the automated agent, or whatever, seriously messed up here, and had 2 versions of the logging and monitoring system running, fighting over a port, and crushing the CPU on every machine in the cluster. On top of that, I couldn't delete the daemon sets, pods, or deployments. It seems Google treats these as special somehow, maybe with some kind of automated agent, I don't know.
From the dropdown, I just selected System and workload logging and monitoring, saved, and it applied the changes.
Everything looking good so far, but this whole event has me worried, I didn't do anything. This just....happened.
This is a dev cluster, but if it was a production cluster...

JMeter throughput drops when hitting Amazon ELB

I am hosting a web application on Amazon's AWS Servers. I am currently in the process of load testing the application with JMeter. My main problem seems to be that when I go through an Elastic Load Balancer (ELB) to hit the Amazon server's rather than hitting the servers directly - I seem to hit a cap in my throughput.
If I hit my web application directly - for each server I am able to achieve a throughput of 50 RPS per server.
If I hit my web application via Amazon's ELB - I am only able to achieve a max throughput of 50 RPS (total)
I was wondering if anyone else has experienced similar behavior when load testing using Jmeter via Amazon's ELB.
For more context my web application is a REST application which allows users to download content (~150 kb) via HTTP requests.
I am running Jmeter with the following flag "-Dsun.net.inetaddr.ttl=0" and running it with 10 threads. I have tried running these tests with multiple clients on different machines.
Thanks for any help in advance.
Load balancers may be tricky to test as they may have different mechanisms of orchestrating traffic depending on origin. The most commonly used approach to distinguish origin of the request and redirect it to the same host, which served previous request is a cookie. You can look into HTTP Cookie Manager to correctly manipulate your cookies and make sure than you have different ones for each testing thread or thread group (depending on your use case). Another flaky area is origin host IP. You may require to bind each testing thread to different IP address in order to hit different servers behind the load balancer. There can be also some issues with DNS in regards to Amazon LBs. useful guide on how to test Amazon ELBs
Most probable cause would be DNS caching by jmeter. ELB returns IPs of additional servers depending on how autoscaling is set but JMeter does not use these additional servers. This problem can be solved by ensuring that Jmeter does not cache DNS results...
The ELB is a name, not IP, and can suffer from DNS caching. Make sure you use "-Dsun.net.inetaddr.ttl=0" when starting JMeter
http://wiki.apache.org/jmeter/JMeterAndAmazon
A really late response, and slightly different than the original question, but I hope this can help others as it took me a while to get it all straight. My original problem was not reduced throughput as a result of the ELB, but the introduction of HTTP 503 errors. Actually, the ELB increased my throughput as compared to querying the web application directly, though even with 1 hour tests, the results were sporadic to say the least.
First, the ELB has 2-staged load balancing going on. The first load balance is across the ELB's themselves. That's done by associating multiple IP addresses to the hostname provided by AWS for the ELB you provision. The second is then, of course, across your application instances behind the ELB.
Without trying to offend the SO gods, this is a really helpful article.
https://blazemeter.com/blog/dns-cache-manager-right-way-test-load-balanced-apps
The most helpful information in there was to use the DNS Cache Manager module in JMeter. This will query multiple DNS servers, and wipe out your DNS cache.
I implemented that module and then setup Wireshark, filtering on the two IP addresses belonging to the ELB hostname and sure enough, it was querying both IP addresses, though clearly favored one over the other.
That didn't make a big difference, at least not over short tests.
The real difference (2-3 times more throughput) came when I tweaked the ELB health settings. I initially had a high error rate, however after reducing the unhealthy threshold and the interval between health checks, my error rates dropped dramatically.
Additionally, whereas all my other tests had been 60 - 90 minutes in duration, this one was 8 hours. I started out with decent throughput and it then quickly dropped (by about 2/3). After about 20 minutes or more, the throughput then started ticking back up and by the end of the test, it had sustained throughput of about 5 times what I was getting without the ELB (which was similar to what the throughput was when it dropped shortly after beginning this test).

jboss clustering GMS, join

I have jboss 5.1.0.
we have configured jboss somehow using clustering, but in fact we do not use clustering while developing or testing. But in order to launch the project i have to type the following:
./run.sh -c all -g uniqueclustername
-b 0.0.0.0 -Djboss.messaging.ServerPeerID=1 -Djboss.service.binding.set=ports-01
but while jboss starting i able to see something like this in the console:
17:24:45,149 WARN [GMS]
join(172.24.224.7:60519) sent to
172.24.224.2:61247 timed out (after 3000 ms), retrying 17:24:48,170 WARN
[GMS] join(172.24.224.7:60519) sent to
172.24.224.2:61247 timed out (after 3000 ms), retrying 17:24:51,172 WARN
[GMS] join(172.24.224.7:60519)
here 172.24.224.7 it is my local IP
though 172.24.224.2 other IP of other developer in our room (and jboss there is stoped).
So, it tries to join to the other node or something. (i'm not very familiar how jboss acts in clusters). And as a result the application are not starting.
What may be the problem in? how to avoid this joining ?
You can probably fix this by specifying
-Djgroups.udp.ip_ttl=0
in your startup. This Sets the IP time-to-live on the JGroups packets to zero, so they never get anywhere, and the cluster will never form. We use this in dev here to stop the various developer machines from forming a cluster. There's no need to specify a unique cluster name.
I'm assuming you need to do clustering in production, is that right? Could you just use the default configuration instead of all? This would remove the clustering stuff altogether.
while setting up the server, keeping the host name = localhost and --host=localhost instead of ip address will solve the problem. That makes the server to start in non clustered mode.

MSMQ redundancy

I'm looking into WCF/MSMQ.
Does anyone know how one handles redudancy with MSMQ? It is my understanding that the queue sits on the server, but what if the server goes down and is not recoverable, how does one prevent the messages from being lost?
Any good articles on this topic?
There is a good article on using MSMQ in the enterprise here.
Tip 8 is the one you should read.
"Using Microsoft's Windows Clustering tool, queues will failover from one machine to another if one of the queue server machines stops functioning normally. The failover process moves the queue and its contents from the failed machine to the backup machine. Microsoft's clustering works, but in my experience, it is difficult to configure correctly and malfunctions often. In addition, to run Microsoft's Cluster Server you must also run Windows Server Enterprise Edition—a costly operating system to license. Together, these problems warrant searching for a replacement.
One alternative to using Microsoft's Cluster Server is to use a third-party IP load-balancing solution, of which several are commercially available. These devices attach to your network like a standard network switch, and once configured, load balance IP sessions among the configured devices. To load-balance MSMQ, you simply need to setup a virtual IP address on the load-balancing device and configure it to load balance port 1801. To connect to an MSMQ queue, sending applications specify the virtual IP address hosted by the load-balancing device, which then distributes the load efficiently across the configured machines hosting the receiving applications. Not only does this increase the capacity of the messages you can process (by letting you just add more machines to the server farm) but it also protects you from downtime events caused by failed servers.
To use a hardware load balancer, you need to create identical queues on each of the servers configured to be used in load balancing, letting the load balancer connect the sending application to any one of the machines in the group. To add an additional layer of robustness, you can also configure all of the receiving applications to monitor the queues of all the other machines in the group, which helps prevent problems when one or more machines is unavailable. The cost for such queue-monitoring on remote machines is high (it's almost always more efficient to read messages from a local queue) but the additional level of availability may be worth the cost."
Not to be snide, but you kind of answered your own question. If the server is unrecoverable, then you can't recover the messages.
That being said, you might want to back up the message folder regularly. This TechNet article will tell you how to do it:
http://technet.microsoft.com/en-us/library/cc773213.aspx
Also, it will not back up express messages, so that is something you have to be aware of.
If you prefer, you might want to store the actual messages for processing in a database upon receipt, and have the service be the consumer in a producer/consumer pattern.