Gwan - Why gwan determine only 1 core? - multicore

I am currently testing the gwan web server, I have a question about the default settings for the gwan worker, and the CPU Core detection.
Running gwan on a Xen server (which contains a 4-core Xeon CPU), the gwan.log file reports that only got a single core was detected:
'./gwan -w 4' used to override 1 x 1-Core CPU(s)
[Wed May 22 06:54:29 2013 GMT] using 1 workers 0[01]0
[Wed May 22 06:54:29 2013 GMT] among 2 threads 0[11]1
[Wed May 22 06:54:29 2013 GMT] (can't use more threads than CPU Cores)
[Wed May 22 06:54:29 2013 GMT] CPU: 1x Intel(R) Xeon(R) CPU E5506 # 2.13GHz
[Wed May 22 06:54:29 2013 GMT] 0 id: 0 0
[Wed May 22 06:54:29 2013 GMT] 1 id: 1 1
[Wed May 22 06:54:29 2013 GMT] 2 id: 2 2
[Wed May 22 06:54:29 2013 GMT] 3 id: 3 3
[Wed May 22 06:54:29 2013 GMT] CPU(s):1, Core(s)/CPU:1, Thread(s)/Core:2
Any idea?
Thanks!!

the gwan.log file reports that I only got a single core:
Xen, like many other hypervisors, is breaking the CPUID instruction and the /proc/cpuinfo Linux kernel structures (both of which are used by G-WAN to detect CPU Cores).
As you have seen, this is a real problem for multithreaded applications designed to scale on multicore.
'./gwan -w 4' used to override 1 x 1-Core CPU(s)
To avoid stupid manual overrides wasting memory and impacting performance, G-WAN checks that the user-supplied thread count is not greater than the actual CPU Core count.
This is why you have the warning: "(can't use more threads than CPU Cores)".
To bypass this protection and warning, you can use the -g ("God") mode:
./gwan -g -w 4
This command line switch is documented in the G-WAN executable help:
./gwan -h
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Usage: gwan [-b -d -g -k -r -t -v -w] [argument]
(grouped options '-bd' are ignored, use '-b -d')
-b: use the TCP_DEFER_ACCEPT TCP option
(this is disabling the DoS shield!)
-d: daemon mode (uses an 'angel' process)
-d:group:user dumps 'root' privileges
-g: do not limit worker threads to Cores
(using more threads than physical Cores
will lower speed and waste resources)
-k: (gracefully) kill local gwan processes
using the pid files of the gwan folder
-r: run the specified C script (and exit)
for general-purpose code, not servlets
-t: store client requests in 'trace' file
(may be useful to debug/trace attacks)
-v: return version number and build date
-w: define the number of worker threads.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Available network interfaces (2):
127.0.0.1 192.168.0.11

Related

How should I limit the slow disconnect of thousands of tcp connections?

I have a golang program running on centos that usually has around 5k tcp clients connected. Every once in a while this number goes to around 15k for about an hour and still everything is fine.
The program has a slow shutdown mode where it stops accepting new clients and slowly kills all currently connected clients over the course of 20 mins. During these slow shutdown periods if the machine has 15k clients, sometimes I get:
Wed Oct 31 21:28:23 2018] net_ratelimit: 482 callbacks suppressed
[Wed Oct 31 21:28:23 2018] TCP: too many orphaned sockets
[Wed Oct 31 21:28:23 2018] TCP: too many orphaned sockets
[Wed Oct 31 21:28:23 2018] TCP: too many orphaned sockets
I have tried adding:
echo "net.ipv4.tcp_max_syn_backlog=5000" >> /etc/sysctl.conf
echo "net.ipv4.tcp_fin_timeout=10" >> /etc/sysctl.conf
echo "net.ipv4.tcp_tw_recycle=1" >> /etc/sysctl.conf
echo "net.ipv4.tcp_tw_reuse=1" >> /etc/sysctl.conf
sysctl -f /etc/sysctl.conf
And these values are set, I see them with their correct new values. A typical sockstat is:
cat /proc/net/sockstat
sockets: used 31682
TCP: inuse 17286 orphan 5 tw 3874 alloc 31453 mem 15731
UDP: inuse 8 mem 3
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0
And ideas how to stop the too many orphaned socket error and crash? Should I increase the 20 min slow shutdown period to 40 mins? Increase tcp_mem? Thanks!

LSF job states for a given job

Let's say my job was running for some time and it went to suspend state due to machine overloading and became running after sometime and got completed.
Now the status acquired by this job were RUNNING -> SUSPEND -> RUNNING
How to get all states acquired by a given job ?
bjobs -l If the job hasn't been cleaned from the system yet.
bhist -l Otherwise. You might need -n, depending on how old the job is.
Here's an example of bhist -l output when a job was suspended and later resumed because the system load temporarily exceeded the configured threshold.
$ bhist -l 1168
Job <1168>, User <mclosson>, Project <default>, Command <sleep 10000>
Fri Jan 20 15:08:40: Submitted from host <hostA>, to
Queue <normal>, CWD <$HOME>, Specified Hosts <hostA>;
Fri Jan 20 15:08:41: Dispatched 1 Task(s) on Host(s) <hostA>, Allocated 1 Slot(
s) on Host(s) <hostA>, Effective RES_REQ <select[type == any] or
der[r15s:pg] >;
Fri Jan 20 15:08:41: Starting (Pid 30234);
Fri Jan 20 15:08:41: Running with execution home </home/mclosson>, Execution CW
D </home/mclosson>, Execution Pid <30234>;
Fri Jan 20 16:19:22: Suspended: Host load exceeded threshold: 1-minute CPU ru
n queue length (r1m)
Fri Jan 20 16:21:43: Running;
Summary of time in seconds spent in various states by Fri Jan 20 16:22:09
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
1 0 4267 0 141 0 4409
At 16:19:22 the jobs was suspended because r1m exceeded the threshold. Later at 16:21:43 the job resumes.

WAL contains references to invalid pages

centos 6.7
postgresql 9.5.3
I've DB servers that are on master-standby replication.
Suddenly, standby server's postgresql process was stopped with this logs.
2016-07-14 18:14:19.544 JST [][5783e03b.3cdb][0][15579]WARNING: page 1671400 of relation base/16400/559613 is uninitialized
2016-07-14 18:14:19.544 JST [][5783e03b.3cdb][0][15579]CONTEXT: xlog redo Heap2/VISIBLE: cutoff xid 1902107520
2016-07-14 18:14:19.544 JST [][5783e03b.3cdb][0][15579]PANIC: WAL contains references to invalid pages
2016-07-14 18:14:19.544 JST [][5783e03b.3cdb][0][15579]CONTEXT: xlog redo Heap2/VISIBLE: cutoff xid 1902107520
2016-07-14 18:14:21.026 JST [][5783e038.3cd9][0][15577]LOG: startup process (PID 15579) was terminated by signal 6: Aborted
2016-07-14 18:14:21.026 JST [][5783e038.3cd9][0][15577]LOG: terminating any other active server processes
And, master server's postgresql logs were nothing special.
But, master server's /var/log/messages was listed as below.
Jul 14 05:38:44 host kernel: sbridge: HANDLING MCE MEMORY ERROR
Jul 14 05:38:44 host kernel: CPU 8: Machine Check Exception: 0 Bank 9: 8c000040000800c0
Jul 14 05:38:44 host kernel: TSC 0 ADDR 1f7dad7000 MISC 90004000400008c PROCESSOR 0:306e4 TIME 1468442324 SOCKET 1 APIC 20
Jul 14 05:38:44 host kernel: EDAC MC1: CE row 1, channel 0, label "CPU_SrcID#1_Channel#0_DIMM#1": 1 Unknown error(s): memory scrubbing on FATAL area : cpu=8 Err=0008:00c0 (ch=0), addr = 0x1f7dad7000 => socket=1, Channel=0(mask=1), rank=4
Jul 14 05:38:44 host kernel:
Jul 14 18:30:40 host kernel: sbridge: HANDLING MCE MEMORY ERROR
Jul 14 18:30:40 host kernel: CPU 8: Machine Check Exception: 0 Bank 9: 8c000040000800c0
Jul 14 18:30:40 host kernel: TSC 0 ADDR 1f7dad7000 MISC 90004000400008c PROCESSOR 0:306e4 TIME 1468488640 SOCKET 1 APIC 20
Jul 14 18:30:41 host kernel: EDAC MC1: CE row 1, channel 0, label "CPU_SrcID#1_Channel#0_DIMM#1": 1 Unknown error(s): memory scrubbing on FATAL area : cpu=8 Err=0008:00c0 (ch=0), addr = 0x1f7dad7000 => socket=1, Channel=0(mask=1), rank=4
Jul 14 18:30:41 host kernel:
The memory error's started at 1 week ago. So, I doubt the memory error causes postgresql's error.
My question is here.
1) Can memory error of kernel cause postgresql's "WAL contains references to invalid pages" error?
2) Why there is not any logs at master server's postgresql?
thx.
Faulty memory can cause all kinds of data corruption, so that seems like a good enough explanation to me.
Perhaps there are no log entries at the master PostgreSQL server because all that was corrupted was the WAL stream.
You can run
oid2name
to find out which database has OID 16400 and then
oid2name -d <database with OID 16400> -f 559613
to find out which table belongs to file 559613.
Is that table larger than 12 GB? If not, that would mean that page 1671400 is indeed an invalid value.
You didn't tell which PostgreSQL version you are using, but maybe there are replication bugs fixed in later versions that could cause replication problems even without a hardware bug present; read the release notes.
I would perform a new pg_basebackup and reinitialize the slave system.
But what I'd really be worried about is possible data corruption on the master server. Block checksums are cool (turned on if pg_controldata <data directory> | grep checksum gives you 1), but possibly won't detect the effects of memory corruption.
Try something like
pg_dumpall -f /dev/null
on the master and see if there are errors.
Keep your old backups in case you need to repair something!

How to run puppetmaster using Apache/Passenger

Running Puppet v2.7.14 on CEntOs 6 and also using Apache/Passenger instead of WEBrick. I was told that puppetmaster service is not required to be running (hence: chkconfig off puppetmaster) running when using httpd and passenger but in my case, if I don't start puppetmasterd manually, none of the agents can connect to the master. I can start httpd just fine and 'passenger' seems to start okay as well. This is my apache configuration file:
# /etc/httpd/conf.d/passenger.conf
LoadModule passenger_module modules/mod_passenger.so
<IfModule mod_passenger.c>
PassengerRoot /usr/lib/ruby/gems/1.8/gems/passenger-3.0.12
PassengerRuby /usr/bin/ruby
#PassengerTempDir /var/run/rubygem-passenger
PassengerHighPerformance on
PassengerUseGlobalQueue on
PassengerMaxPoolSize 15
PassengerPoolIdleTime 150
PassengerMaxRequests 10000
PassengerStatThrottleRate 120
RackAutoDetect Off
RailsAutoDetect Off
</IfModule>
Upon restart, I see these in the httpd_error log:
[Sat Jun 09 04:06:47 2012] [notice] caught SIGTERM, shutting down
[Sat Jun 09 09:06:51 2012] [notice] suEXEC mechanism enabled (wrapper: /usr/sbin/suexec)
[Sat Jun 09 09:06:51 2012] [notice] Digest: generating secret for digest authentication ...
[Sat Jun 09 09:06:51 2012] [notice] Digest: done
[Sat Jun 09 09:06:51 2012] [notice] Apache/2.2.15 (Unix) DAV/2 Phusion_Passenger/3.0.12 mod_ssl/2.2.15 OpenSSL/1.0.0-fips configured -- resuming normal operations
And passenger-status prints these info on the screen:
----------- General information -----------
max = 15
count = 0
active = 0
inactive = 0
Waiting on global queue: 0
----------- Application groups -----------
But still, as I said, none of my agents can actually talk to the master until I start puppetmasterd manually. Does anyone know what am I still missing? Or, is this the way it supposed too be? Cheers!!
It sounds like you may be missing an /etc/httpd/conf.d/puppetmaster.conf file that's based on https://github.com/puppetlabs/puppet/blob/master/ext/rack/files/apache2.conf
Without something like this, you're missing the glue that'll map port 8140 to rack-based pupeptmastd.
See http://docs.puppetlabs.com/guides/passenger.html
https://github.com/puppetlabs/puppet/tree/master/ext/rack
http://www.modrails.com/documentation/Users%20guide%20Apache.html#_deploying_a_rack_based_ruby_application_including_rails_gt_3
After a few days of banging head, now it's running. The main problem was with port number - the puppetmaster was running on different port than puppet agent trying to communicate on.
Another thing is: RackAutoDetect On must be executed before the dashboard vhost file. My So, I renamed passenger vhost file as: 00_passenger.conf to make sure it runs first. After that I get puppetmaster running using Apache/Passenger. Cheers!!

sphinx performance after idle

I'm writing a web application for which I'm using Sphinx to search around a million documents.
The performance is excellent, with a typical query taking just 0.05 seconds, but if no queries are made for a few hours, it suddenly takes much longer - up to 1000x longer for a couple of queries, then returns to normal. The query log looks like this:
[Wed Mar 7 17:23:55.937 2012] 0.221 sec
[Wed Mar 7 17:32:00.726 2012] 0.012 sec
[Wed Mar 7 17:32:00.984 2012] 0.052 sec
[Wed Mar 7 17:32:01.416 2012] 0.222 sec
[Thu Mar 8 09:15:10.418 2012] 10.147 sec
[Thu Mar 8 09:16:00.560 2012] 48.262 sec
[Thu Mar 8 09:16:55.429 2012] 54.153 sec
[Thu Mar 8 09:17:54.454 2012] 0.012 sec
[Thu Mar 8 09:17:54.713 2012] 0.052 sec
[Thu Mar 8 09:17:55.141 2012] 0.218 sec
I'm guessing maybe my busy server is swapping Sphinx's memory when it is unused or something.
How can I resolve this?
I considered scripting fake queries every minute but that seems like quite an ugly hack.
How fast are the disks on this server?
I imagine this due to having to go back to disk for the indexes. Once accessed a few times, the OS will have cached the files.
It might be worth considering a SSD disk. A small SSD disk - big enough for sphinxes indexs - is relativly cheap nowadays.
If it really is memory swapping, that also suggests you have slow disks. But that is also something to address. Can you add more memory to server? (or even put the swap partition on the newly installed SSD disk :)
btw, find out if swapping is happening with something like Munin (or Cacti etc)