PlayFramework Hangs After Days - scala

The server run successfully at one time, but it hangs after days with no error logs. Then, all requests would not get the response.
This is the start command with options
sudo /opt/dev -Dhttps.port=443 -Dhttp.port=9000 -J-Xms3277m -J-Xmx3277m -J-XX:ParallelGCThreads=2 -J-Xmn2574M -J-XX:+UseConcMarkScMarkSweepGC -J-XX:+CMSClassUnloadingEnabled -J-server &
/opt/dev is the script file generated from activator stage
===========server info==========
linux: Ubuntu 14.04.5 LTS
ram: 4G
openjdk version "1.8.0_141"
===========process info========
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME COMMAND
15037 root 20 0 5978800 2.280g 31216 S 0.0 58.3 63:33.82 java
===========port info ===================
tcp6 :::9000 :::* LISTEN 15037/java
tcp6 :::443 :::* LISTEN 15037/java
===========other info==========
play version 2.3.2
scala version 2.11.1
akka setting
akka.jvm-exit-on-fatal-error = false
play.akka.jvm-exit-on-fatal-error = false
akka.default-dispatcher.fork-join-executor.pool-size-max =64
akka.actor.debug.receive = on
===========================================

These steps could help identify the problem.. or they could be just first steps in this direction.
Try to start with adding -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/where/to/put/hprof according you start script params think you need to use -J-XX instead of -XX. This will create heap-dump in case of OOM.
Add logging in endpoints (at start and at end) to be able to check if play receives request or even this does not happen.
While you have unresponsive play, try to check open file descriptors and compare it with your limits. To check you can find pid of your java process and call sudo ls -al /proc/7333/fd/|wc -l to see your limits use ulimit -a.
Would be nice to try to control akka queues. For the case if you use same dispatcher for frontend requests purposes and for some backoffice processing (dispatcher could be filled with long background tasks)

I would do all the diagnostic steps that Evgeny suggested, plus:
Change "akka.jvm-exit-on-fatal-error" and "play.akka.jvm-exit-on-fatal-error" to true, this may be masking your problem.
Take a stack dump of the running process when it is in this state and use that to identify the problem or post it here. See How to get a complete stack trace of a running java program that is taking 100% cpu?

Related

Enable FTRACE in ARM to trace realtime characteristic of system

I followed this article to enable FTRACE
https://lwn.net/Articles/365835/
to test a realtime system, my system uses arm cortexa15 (Description: https://mp.renesas.com/en-us/rzg/marketplace/board/RZGB000003.html)
CONFIG_FTRACE=y
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
CONFIG_STACK_TRACER=y
CONFIG_DYNAMIC_FTRACE=y
But, it didn't work, caused the system hang-up when starting kernel.
Even referred How to Enable or configure ftrace module
I would like to test latency in the realtime system with cyclictest (option -b to trigger FTRACER)
cyclictest -a -t -n -p99 -f -b100
It generated dump message:
INFO: debugfs mountpoint: /sys/kernel/debug/tracing/
WARN: tracing_enabled or tracing_on not found
debug fs not mounted, TRACERs not configured?
could not set ftrace_enabled to 0
FATAL: Can't open /sys/kernel/debug/tracing/available_tracers for reading
I repeated a next step to enable a group of tracer configs:
CONFIG_FTRACE=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_HAVE_C_RECORDMCOUNT=y
CONFIG_TRACING_SUPPORT=y
CONFIG_FUNCTION_TRACER=y
CONFIG_IRQSOFF_TRACER=y
CONFIG_FUNCTION_PROFILER=y
CONFIG_PREEMPT_TRACER=y
CONFIG_SCHED_TRACER=y
CONFIG_ENABLE_DEFAULT_TRACERS=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_TRACER_SNAPSHOT=y
CONFIG_STACK_TRACER=y
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_TRACEPOINT_BENCHMARK=y
CONFIG_BACKTRACE_SELF_TEST=y
CONFIG_EARLY_PRINTK=y
CONFIG_DEBUG_LL=y
The result still was the same. Kernel hung and didn't show anything.
Anyone who deal with realtime system and Ftrace can help ? Thanks.
I resolve my problem. Below is a part of my defconfig file.
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_TRACEPOINTS=y
CONFIG_STACKTRACE=y
CONFIG_NOP_TRACER=y
CONFIG_TRACER_MAX_TRACE=y
CONFIG_TRACE_CLOCK=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_GENERIC_TRACER=y
CONFIG_FTRACE=y
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
CONFIG_SCHED_TRACER=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_TRACER_SNAPSHOT=y
CONFIG_STACK_TRACER=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_FTRACE_MCOUNT_RECORD=y
After enabling Ftrace tool, the culprit can then be found in the trace output at /sys/kernel/debug/tracing/trace. The kernel function that was executed just before a latency of more than 100 microseconds was detected is marked with an exclamation mark.

Mappers fail for pig to insert data into MongoDB

I am trying to import a file from HDFS to MongoDB using MongoInsertStorage with PIG. The files are large, around 5GB. The script runs fine when I run it in local mode with
pig -x local example.pig
However if I run it in the mapreduce mode, Most of the mappers fail with the following error:
Error: com.mongodb.ConnectionString.getReadConcern()Lcom/mongodb/ReadConcern;
Container killed by the ApplicationMaster.
Container killed on request.
Exit code is 143 Container exited with a non-zero exit code 143
Can someone help me solve this issue?? I also increased the memory allocated to YARN containers but that hasnt helped.
Some mappers are also timing out after 300 seconds.
Pig Script is as follows
REGISTER mongo-java-driver-3.2.2.jar
REGISTER mongo-hadoop-core-1.4.0.jar
REGISTER mongo-hadoop-pig-1.4.0.jar
REGISTER mongodb-driver-3.2.2.jar
DEFINE MongoInsertStorage com.mongodb.hadoop.pig.MongoInsertStorage();
SET mapreduce.reduce.speculative true
BIG_DATA = LOAD 'hdfs://example.com:8020/user/someuser/sample.csv' using PigStorage(',') As (a:chararray,b:chararray,c:chararray);
STORE BIG_DATA INTO 'mongodb://insert.some.ip.here:27017/test.samplecollection' USING MongoInsertStorage('', '')
Found a solution.
For the error
Error: com.mongodb.ConnectionString.getReadConcern()Lcom/mongodb/ReadConcern;
Container killed by the ApplicationMaster.
Container killed on request.
Exit code is 143 Container exited with a non-zero exit code 143
I changed the JAR versions - hadoopcore and hadooppig from 1.4.0 to 2.0.2 and for Mongo Java driver from 3.2.2 to 3.4.2. This eliminated the ReadConcern Error on the mappers!
For the timeout, I added this after registering the jars:
SET mapreduce.task.timeout 1800000
I had been using SET mapred.task.timeout which didnt work
Hope this helps anyone who has a similar issue!

Solaris svcs command shows wrong status

I have freshly installed an application on solaris 5.10 . When checked through ps -ef | grep hyperic | grep agent, process are up and running . When checked the status through svcs hyperic-agent command, the output shows that the agent is in maintenance mode . Application is working fine and I dont have any issues with the application . Please help
There are several reasons that lead to that behavior:
Starter (start/exec property of service) returned status that is different from SMF_EXIT_OK (zero). Than you may check logs:
# svcs -x ssh
...
See: /var/svc/log/network-ssh:default.log
If you check logs, you may see following messages that means, starter script failed or incorrectly written:
[ Aug 11 18:40:30 Method "start" exited with status 96 ]
Another reason for such behavior is that service faults during while its working (i.e. one of processes coredumps or receives kill signal or all processes exits) as described here: https://blogs.oracle.com/lianep/entry/smf_5_fault_retry_models
The actual system that provides SMF facilities for monitoring that is System Contracts. You may determine contract ID of online service with svcs -v (field CTID):
# svcs -vp svc:/network/smtp:sendmail
STATE NSTATE STIME CTID FMRI
online - Apr_14 68 svc:/network/smtp:sendmail
Apr_14 1679 sendmail
Apr_14 1681 sendmail
Than watch events with ctwatch:
# ctwatch 68
CTID EVID CRIT ACK CTTYPE SUMMARY
68 28 crit no process contract empty
Than there are two options to handle that:
There is a real problem with service so it eventually faults. Than debug the application.
It is normal behavior of service, so you should edit and re-import your service manifest, to make SMF less paranoid. I.e. configure ignore_error and duration properties.

Custom Munin plugin won't report

I've built my first Munin plugin to give us the size of our Redis queue, but it won't report for some reason. Every other plugin on the node, including other Redis-centric plugins work fine.
Here's the plugin code:
#!/bin/sh
case $1 in
config)
cat <<'EOM'
multigraph redis_queue_size
graph_title Redis Queue Size
graph_info The size of Redis queue
graph_category redis
graph_vlabel Messages
redisqueue.label redisqueue
redisqueue.type GAUGE
redisqueue.min 0
EOM
exit 0;;
esac
queuelength=`redis-cli llen mykeyname`
printf "redisqueue.value "
echo $queuelength
The plugin is in /usr/share/munin/plugins/redis_queue_
The plugin is symlinked to /etc/munin/plugins/redis_queue_
I made sure to restart the service
$ sudo service munin-node force-reload
If I run sudo munin-run redis_queue_ I get the correct output:
redisqueue.value 1567595
If I run munin-node-config I get the following:
redis_queue_ | yes |
If I connect to the instance from the master using telnet to fetch the plugin, I get:
$ telnet 10.101.21.56 4949
Trying 10.101.21.56...
Connected to 10.101.21.56.
Escape character is '^]'.
# munin node at redis01.example.com
fetch redis_queue_
redisqueue.value 1035336
The master shows an empty graph for it, but the "last updated" time isn't increasing. I initially had the plugin configured a little differently (it wasn't producing good output) so all the values are -nan. Once I fixed the output, I expected the plugin to start working, but all efforts have failed.
Everything looks right, but yet still no values in the graph.
Edit: Munin v1.4.6

python-memcache memcached -- I installed on centos virtualbox but it get/set never seem to work

I'm using python. I did a yum install memcached followed by a easy_install python-memcached
I used the simple test program from the Help(memcache). When I wasn't getting the proper answers I threw in some print statements:
[~/test]$ cat m2.py
import memcache
mc = memcache.Client(['127.0.0.1:11211'], debug=0)
x = mc.set("some_key", "Some value")
print 'Just set a key and value into the cache (suposedly)'
value = mc.get("some_key")
print 'Just retrieved that value from the cache using the key'
print 'X %s' % x
print 'Value %s' % value
[~/test]$ python m2.py
Just set a key and value into the cache (suposedly)
Just retrieved that value from the cache using the key
X 0
Value None
[~/test]$
The question now is, what have I failed to do in my installation? It appears to be working from an API perspective but it fails to put anything into the memcache share area.
I'm using a virtualbox vm running centos
[~]# cat /proc/version
Linux version 2.6.32-358.6.2.el6.i686 (mockbuild#c6b8.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Thu May 16 18:12:13 UTC 2013
Is there a daemon that is supposed to be running? I don't see an obvious named one when I do a ps.
I tried to get pylibmc installed on my vm but was unable to find a working installation so for now will see if I can get the above stuff working first.
I discovered if i ran straight from the python console GUI i get a bit more output if I set debug=1
>>> mc = memcache.Client(['127.0.0.1:11211'], debug=1)
>>> mc.stats
{}
>>> mc.set('test','value')
MemCached: MemCache: inet:127.0.0.1:11211: connect: Connection refused. Marking dead.
0
>>> mc.get('test')
MemCached: MemCache: inet:127.0.0.1:11211: connect: Connection refused. Marking dead.
When I try to use per the example telnet to connect to the port i get a connection refused:
[root#~]# telnet 127.0.0.1 11211
Trying 127.0.0.1...
telnet: connect to address 127.0.0.1: Connection refused
[root#~]#
I tried the instructions I found on the net for configuring telnet so localhost wouldn't be disabled:
vi /etc/xinetd.d/telnet
service telnet
{
flags = REUSE
socket_type = stream
wait = no
user = root
server = /usr/sbin/in.telnetd
log_on_failure += USERID
disable = no
}
And then ran the commands to restart the service(s):
service iptables stop
service xinetd stop
service iptables start
service xinetd start
service iptables stop
I ran with both cases (iptables started and stopped) but it has no effect. So I am out of ideas. What do I need to do to make it so the PORT will be allowed? if that is the problem?
Or is there a memcached service that needs to be running that needs to open up the port ?
well this is what it took to get it working: ( a series of manual steps )
1) su -
cd /var/run
mkdir memcached # this was missing
In the memcached file I added "-l 127.0.0.1" to the OPTIONS statement. It's apparently a listen option. Do this for steps 2 & 3. I'm not certain which file is actually used at runtime.
2) cd /etc/sysconfig
cp memcached memcached.old
vi memcached
3) cd /etc/init.d
cp memcached memcached.old
vi memcached
4) Try some commands to see if the server starts now
/etc/init.d/memcached start
/etc/init.d/memcached status
/etc/init.d/memcached stop
/etc/init.d/memcached restart
I tried opening a browser, but it never seemed to actually display anything so I don't really know how valid this approach is. I'm not running apache or anything like this so perhaps its not relevant to my cause. Perhaps I would have to supply a ?key=blah or something.
5) http://127.0.0.1:11211
6) Now it should be ready to go. If one runs the test shown with the following it should work. At least it did for me. doing the help(memcache) will display a simple program. just paste that in and it should work just fine.
[~]$ python
>>> import memcache
>>> help(memcache)