NFS mount points are going off/NFS compound failed for server mashost - solaris

We have an application in solaris during specific test case we will generate heap dump which will be written in to the server at specific path during this case we are getting following error in trace file
java.lang.OutOfMemoryError: Java heap space
Dumping heap to /ossrc/upgrade/JREheapdumps/java_pid16092.hprof ...
Dump file is incomplete: I/O error
and in /var/adm/messages we could see
Oct 28 13:00:10 ossuas2 nfs: [ID 733954 kern.info] NOTICE: [NFS4][Server: mashost][Mntpt: /ossrc/upgrade]NFS server mashost not
responding; still trying
Oct 28 13:02:53 ossuas2 nfs: [ID 733954 kern.info] NOTICE: [NFS4][Server: mashost][Mntpt: /usr/local]NFS server mashost not
responding; still trying
Oct 28 13:04:53 ossuas2 nfs: [ID 733954 kern.info] NOTICE: [NFS4][Server: mashost][Mntpt: /etc/opt/ericsson]NFS server mashost not
responding; still trying
Can anyone please help here why we are getting this problem and can any tell us can an application cause this impact on mashost ..????

First things first, check out the NFS service w/ svcbundle and svcs -- when it crashes, run:
# svcs -x nfs/client
on the client, and
# svcs -x nfs/server
on the server. I would expect one or both to be in a "maintenance" state. (You may see it fails to start properly at all). If it is in a maintenance mode, you should see a row marked "Reason:" that says why.
You might see "offline" -- in that case, startd will attempt to reboot the service multiple times and, if it fails after five attempts or hangs indefinitely, places it into "maintenance" state and stops restarting.
Check the logs in
/var/svc/log/<service-name FMRI>.log
There will be one on your client machine under "network-nfs-client:default" (probably, may have a name other than 'default' if it's been changed manually), and one on the server under "network-nfs-server:default"
See what you can glean from those.
svcbundle is all the time taking snapshots as backups of services, so you can try reverting to one of those.
# svcs -s nfs/server:default
svc:/network/nfs/server:default> listsnap
svc:/network/nfs/server:default> revert start [name_of_snapshot]
svc:/network/nfs/server:default> quit
# svcadm refresh nfs/server:default
# svcadm restart nfs/server:default
Make sure to include the ":default" tag, or if you saw a different tag from "svcs nfs/server" include it, that name defines an instance of the service, every running service is an instance.
If the process is failing to boot, you might have to look at the XML manifest under /lib/svc/manifest/network/nfs/ -- inside, you'll see dependencies (and services dependent on this one), then "exec_method"s, which define how the service starts, stops and restarts.
Instead of snapshots, you can can also restore it to default: use svccfg -s <FMRI> delete to clear it, then svcadm refresh <FMRI> and svcadm enable <FMRI>.
If the service is in maintenance state, once you've isolated and fixed the problem, you can manually clear that state by running svcadm clear <FMRI>.

Related

Kubernetes DSE Cassandra CommitLogReplayer$CommitLogReplayException

I have installed Cassandra on Kubernetes (9 pods) All the pods are up and running except
for one pod, which shows the below error.
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Encountered bad header at position 47137 of commit log /var/lib/cassandra/commitlog/CommitLog-600-1630582314923.log, with bad position but valid CRC
at org.apache.cassandra.db.commitlog.CommitLogReplayer.shouldSkipSegmentOnError(CommitLogReplayer.java:438)
at org.apache.cassandra.db.commitlog.CommitLogReplayer.handleUnrecoverableError(CommitLogReplayer.java:452)
at org.apache.cassandra.db.commitlog.CommitLogSegmentReader$SegmentIterator.computeNext(CommitLogSegmentReader.java:109)
at org.apache.cassandra.db.commitlog.CommitLogSegmentReader$SegmentIterator.computeNext(CommitLogSegmentReader.java:84)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
at org.apache.cassandra.db.commitlog.CommitLogReader.readCommitLogSegment(CommitLogReader.java:236)
at org.apache.cassandra.db.commitlog.CommitLogReader.readAllFiles(CommitLogReader.java:134)
at org.apache.cassandra.db.commitlog.CommitLogReplayer.replayFiles(CommitLogReplayer.java:154)
at org.apache.cassandra.db.commitlog.CommitLog.recoverFiles(CommitLog.java:213)
at org.apache.cassandra.db.commitlog.CommitLog.recoverSegmentsOnDisk(CommitLog.java:194)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:338)
at com.datastax.bdp.server.DseDaemon.setup(DseDaemon.java:527)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:702)
at com.datastax.bdp.DseModule.main(DseModule.java:96)
Caused by: org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException: Encountered bad header at position 47137 of commit log /var/lib/cassandra/commitlog/CommitLog-600-1630582314923.log, with bad position but valid CRC
at org.apache.cassandra.db.commitlog.CommitLogSegmentReader$SegmentIterator.computeNext(CommitLogSegmentReader.java:111)
... 12 more
ERROR [main] 2021-09-06 06:19:08,990 JVMStabilityInspector.java:251 - JVM state determined to be unstable. Exiting forcefully due to:
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Encountered bad header at position 47137 of commit log /var/lib/cassandra/commitlog/CommitLog-600-1630582314923.log, with bad position but valid CRC
at org.apache.cassandra.db.commitlog.CommitLogReplayer.shouldSkipSegmentOnError(CommitLogReplayer.java:438)
at org.apache.cassandra.db.commitlog.CommitLogReplayer.handleUnrecoverableError(CommitLogReplayer.java:452)
at org.apache.cassandra.db.commitlog.CommitLogSegmentReader$SegmentIterator.computeNext(CommitLogSegmentReader.java:109)
at org.apache.cassandra.db.commitlog.CommitLogSegmentReader$SegmentIterator.computeNext(CommitLogSegmentReader.java:84)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
at org.apache.cassandra.db.commitlog.CommitLogReader.readCommitLogSegment(CommitLogReader.java:236)
at org.apache.cassandra.db.commitlog.CommitLogReader.readAllFiles(CommitLogReader.java:134)
at org.apache.cassandra.db.commitlog.CommitLogReplayer.replayFiles(CommitLogReplayer.java:154)
at org.apache.cassandra.db.commitlog.CommitLog.recoverFiles(CommitLog.java:213)
at org.apache.cassandra.db.commitlog.CommitLog.recoverSegmentsOnDisk(CommitLog.java:194)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:338)
at com.datastax.bdp.server.DseDaemon.setup(DseDaemon.java:527)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:702)
at com.datastax.bdp.DseModule.main(DseModule.java:96)
Can someone help me out please
For whatever reason, one of the commit log segments got corrupted on the node.
You can workaround the issue by manually deleting this file on the pod:
/var/lib/cassandra/commitlog/CommitLog-600-1630582314923.log
Interestingly, that commit log segment was created on September 2 (1630582314923) but the log entry you posted was from September 6. This indicates something happened to the pod which resulted in the corrupted file.
You'll need to review the Cassandra logs on the pod (not the pod logs itself) to determine the root cause and address it. Cheers!

Solaris svcs command shows wrong status

I have freshly installed an application on solaris 5.10 . When checked through ps -ef | grep hyperic | grep agent, process are up and running . When checked the status through svcs hyperic-agent command, the output shows that the agent is in maintenance mode . Application is working fine and I dont have any issues with the application . Please help
There are several reasons that lead to that behavior:
Starter (start/exec property of service) returned status that is different from SMF_EXIT_OK (zero). Than you may check logs:
# svcs -x ssh
...
See: /var/svc/log/network-ssh:default.log
If you check logs, you may see following messages that means, starter script failed or incorrectly written:
[ Aug 11 18:40:30 Method "start" exited with status 96 ]
Another reason for such behavior is that service faults during while its working (i.e. one of processes coredumps or receives kill signal or all processes exits) as described here: https://blogs.oracle.com/lianep/entry/smf_5_fault_retry_models
The actual system that provides SMF facilities for monitoring that is System Contracts. You may determine contract ID of online service with svcs -v (field CTID):
# svcs -vp svc:/network/smtp:sendmail
STATE NSTATE STIME CTID FMRI
online - Apr_14 68 svc:/network/smtp:sendmail
Apr_14 1679 sendmail
Apr_14 1681 sendmail
Than watch events with ctwatch:
# ctwatch 68
CTID EVID CRIT ACK CTTYPE SUMMARY
68 28 crit no process contract empty
Than there are two options to handle that:
There is a real problem with service so it eventually faults. Than debug the application.
It is normal behavior of service, so you should edit and re-import your service manifest, to make SMF less paranoid. I.e. configure ignore_error and duration properties.

Glassfish4 log rotation "Maximum History Files" issue

Environment:
Glassfish 4.0 (only one DAS), Windows Server 2012 R2, Java 1.7.0_51
Create the DAS instance service by using the create-service subcommand.
Issue:
The maximum history files attribute has been set, however, Glassfish Server couldn’t remove the old log files due to the lock file server.log.lck
Path --> C:\glassfish4\glassfish\domains\domain1\config\logging.properties
com.sun.enterprise.server.logging.GFFileHandler.maxHistoryFiles=10
Log Snippet:
[2014-12-10T18:00:39.372+0900] [glassfish 4.0] [SEVERE] [] [] [tid: _ThreadID=16 _ThreadName=Thread-5] [timeMillis: 1418202039372] [levelValue: 1000] [[
java.util.logging.ErrorManager: 0: FATAL ERROR: COULD NOT DELETE LOG FILE.]]
[2014-12-10T18:00:39.372+0900] [glassfish 4.0] [SEVERE] [] [] [tid: _ThreadID=16 _ThreadName=Thread-5] [timeMillis: 1418202039372] [levelValue: 1000] [[
java.io.IOException: Could not delete log file: C:\glassfish4\glassfish\domains\domain1\logs\server.log.lck
at com.sun.enterprise.server.logging.GFFileHandler.cleanUpHistoryLogFiles(GFFileHandler.java:725)
at com.sun.enterprise.server.logging.GFFileHandler$4.run(GFFileHandler.java:802)
at java.security.AccessController.doPrivileged(Native Method)
at com.sun.enterprise.server.logging.GFFileHandler.rotate(GFFileHandler.java:744)
at com.sun.enterprise.server.logging.GFFileHandler$1.run(GFFileHandler.java:301)
at com.sun.enterprise.server.logging.LogRotationTimerTask.run(LogRotationTimerTask.java:68)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)]]
Findings:
1, If the lock file “server.log.lck” exists in the log folder, the issue occurred, and can find the above errors in the log every day when Glassfish server tries to remove the old log files. If there is no “server.log.lck” in the log folder, no any issue and work properly.
2, If the DAS instance is started by the command “asadmin start-domain domain1”, there is no lock file “server.log.lck” generated in the log folder. But if the DAS instance is started in Windows Service, the lock file “server.log.lck” will be generated automatically and keep 0KB until stop the service, this file will be removed automatically.
3, If the DAS instance is started by the command “asadmin start-domain -w domain1” which adds the watchdog option, the lock file “server.log.lck” will be generated automatically and exist until stop the service.
4, When the lock file “server.log.lck” appears, there is always one more java.exe process existing. Therefore, when start the DAS instance from Windows Service, there are two “java.exe” running in the process and “server.log.lck” is using by one of them.
Questions:
1, I’d like to start/stop the DAS instance by Windows Service, not using the subcommand. Moreover, I don’t want to keep all Glassfish logs on my server and it will cause a disk full issue so that I would prefer to turn on the Glassfish Logging Maximum history Files option. Is there any workaround or solution for that?
2, Is this a defect of Glassfish or it’s just a setting issue? I did try to install on other servers and all had the same issue.
3, Why there are two java.exe processes running if started from Windows Server, is the 2nd one used for “watchdog”?
Thank you so much for your help and please let me know if there is any further information you’d like to know or want me to do some other tests.
In case someone is still struggling I found a solution.
When you create a GF service in Windows environment via asadmin create-service GF creates a file domain1Service.xml in glassfish\domains\domain1\bin which contains parametres for server to start.
It looks something like the following
<service>
<id>domain1</id>
<name>domain1 GlassFish Server</name>
<description>GlassFish Server</description>
<executable>C:/Supertel-NMSv3/glassfish-4.1/glassfish/lib/nadmin.bat</executable>
<logpath>C:\\Supertel-NMSv3\\glassfish-4.1\\glassfish\\domains/domain1/bin</logpath>
<logmode>reset</logmode>
<depend>tcpip</depend>
<startargument>start-domain</startargument>
<startargument>--watchdog</startargument>
<startargument>--domaindir</startargument>
<startargument>C:\\Supertel-NMSv3\\glassfish-4.1\\glassfish\\domains</startargument>
<startargument>domain1</startargument>
<stopargument>stop-domain</stopargument>
<stopargument>--domaindir</stopargument>
<stopargument>C:\\Supertel-NMSv3\\glassfish-4.1\\glassfish\\domains</stopargument>
<stopargument>domain1</stopargument>
</service>
the line <startargument>--watchdog</startargument> is responsible for launching watchdog process which prevents log file from being deleted.
You can't just delete this startargument section (the service won't start) but you can switch this off by setting false flag like this
<startargument>--watchdog=false</startargument>
After that the service will start like via manual start-domain command without watchdog process.
You should do it after every service creation and it could be pretty annoying so I did further research.
It turns out that asadmin create OS specific domainService.xml by using templates located in glassfish\lib\install\templates Those templates also OS specific. And template for Windows (named Domain-service-winsw.xml.template) looks like this
<service>
<id>%%%NAME%%%</id>
<name>%%%DISPLAY_NAME%%%</name>
<description>GlassFish Server</description>
<executable>%%%AS_ADMIN_PATH%%%</executable>
<logpath>%%%LOCATION%%%/%%%ENTITY_NAME%%%/bin</logpath>
<logmode>reset</logmode>
<depend>tcpip</depend>
<startargument>%%%START_COMMAND%%%</startargument>
<startargument>--watchdog</startargument>
%%%CREDENTIALS_START%%%%%%LOCATION_ARGS_START%%%<startargument>%%%ENTITY_NAME%%%</startargument>
<stopargument>%%%STOP_COMMAND%%%</stopargument>
%%%CREDENTIALS_STOP%%%%%%LOCATION_ARGS_STOP%%%<stopargument>%%%ENTITY_NAME%%%</stopargument>
</service>
So you can edit template directly by setting param --watchdog=false and this change will reflect in all future created files domainService.xml
Hope it helps.
That’s not the right solution. Watchdog has an important function: it monitors whether the service is running or not. Without watchdog, glassfish is started correctly, but shortly afterwards the system no longer knows if the service is still running or maybe crashed. In the Services GUI, only the “start” button is active (always!). A “stop” and “restart” cannot be used.
A right solution would be the possibility to change the path to the lock file.

Unable to run Mongo shell (Mac)

I'm new to web development and I wanted to get started with some RoR (using Locomotive CMS).
One of the things Locomotive asks for is to have Mongodb. I installed using homebrew by following this link http://docs.mongodb.org/manual/tutorial/install-mongodb-on-os-x/
It installs fine but then im not able to run it!
When I type 'mongo' on terminal I get the following output :
"MongoDB shell version: 2.4.3
connecting to: test
Mon May 6 11:12:28.927
JavaScript execution failed:
Error: couldn't connect to server
127.0.0.1:27017 at src/mongo/shell/mongo.js:L112
exception: connect failed"
BACKGROUND TO HELP DEBUGGING ( on Terminal) :
1.When I type in mongod I get the following :
"all output going to: /usr/local/var/log/mongodb/mongo.log"
Ownership of mongo.log :
-rw-r--r-- 1 username admin 22133 May 6 11:13 mongo.log
2.When I input mongod --fork I get the following :
about to fork child process, waiting until server is ready for connections.
forked process: 77566
all output going to: /usr/local/var/log/mongodb/mongo.log
ERROR: child process failed, exited with error number 100
3.Typing mongod --help gives the following warning:
* WARNING: soft rlimits too low. Number of files is 256, should be at least 1000
4.I have a folder called data (which acts as amongodb database, is this where it should be?)in root (PATH : /data) Ownership of data folder :
"drwxr-xr-x 3 username wheel 102 Apr 23 21:38 data"
5.Checking if ports are free: lsof -i :27017. Ive also tried to check for a running mongo process using activity montior and found zilch!
No output
6.Ive also tried : mongo --repair. Dint help!
Ive been stuk on this for a while, I've looked at most responses on stackoverflow and searched around to find a solution to this but nothing has helped so far!
UPDATE:
When I tried to start the mongo shell, I was getting the following l
log message from mongo.log:
5/6/13 1:33:27.616 PM com.apple.launchd:
(org.mongodb.mongod[79133])
open("/private/var/log/mongodb/output.log", ...): Permission denied
So I did a chmod777 for the particular folder and the shell launches!
Although I still get a warning when it launches as:
Server has startup warnings:
Mon May 6 13:33:27.693 [initandlisten]
Mon May 6 13:33:27.693 [initandlisten]
** WARNING: soft rlimits too low.
Number of files is 256, should be at least 1000
Any idea how I can silence these warnings?
To get the information you need to determine the cause of failure you need to look in (and post for us) the output from /usr/local/var/log/mongodb/mongo.log when it is trying to start.
However, the most common reason for the failure is the lack of the default database path - at /data/db. Either create that folder (and don't forget to make sure your user has permission to read/write to it) or specify a different path with the --dbpath option.
UPDATE: as you have since found, bad permissions on the log file can cause the issue, in a similar way to bad permissions on the data path.
In terms of the warning, the information you need is here:
https://superuser.com/questions/433746/is-there-a-fix-for-the-too-many-open-files-in-system-error-on-os-x-10-7-1
It is just that though, a warning - you can run MongoDB without an issue with those limits as long as it is not under heavy load. So, if this is a development environment, unless you plan on load testing, you should be fine

Issues with running two instances of searchd

I have just updated our Sphinx server from 1.10-beta to 2.0.6-release, and now I have run into some issues with searchd. Previously we were able to run two instances of searchd next to each other by specifying two different config-files, i.e:
searchd --config /etc/sphinx/sphinx.conf
searchd --config /etc/sphinx/sphinx.staging.conf
sphinx.conf listens to 9306:mysql41, and 9312, while sphinx.staging.conf listens to 9307:mysql41 and 9313.
After we updated to 2.0.6 however, a second instance is never started. Or rather.. the output makes it seem like it starts, and a pid-file is created etc. But for some reason only the first searchd instance keeps running, and the second seems to shutdown right away. So while trying to run searchd --config /etc/sphinx/sphinx.conf twice (if that was the first one started) complains that the pid-file is in use, trying to run searchd --config /etc/sphinx/sphinx.staging.conf (if that is the second started instance) "starts" the daemon again and again, only no new process is created..
Note that if I switch these commands around when first creating the process, then sphinx.conf is the instance not really started.
I have checked, and rechecked, that these ports are only used by searchd.
Does anyone have any idea of what I can do/try next? I've installed it from source on ubuntu 10.04 LTS with:
./configure --prefix /etc/sphinx --with-mysql --enable-id64 --with-libstemmer
make -j4 install
Note to self: Check the logs!
RT-indices use binary logs to enable crash recovery. Since my old config files did not specify a path for where these should be stored, both instances of searchd tried to write to the same binary logs. The instance started last was of course not permitted to manipulate these files, and thus exited with a fatal error:
[Fri Nov 2 17:13:32.262 2012] [ 5346] FATAL: failed to lock
'/etc/sphinx/var/data/binlog.lock': 11 'Resource temporarily unavailable'
[Fri Nov 2 17:13:32.264 2012] [ 5345] Child process 5346 has been finished,
exit code 1. Watchdog finishes also. Good bye!
The solution was simple, ensure to specify a binlog_path inside the searchd configuration section of each configuration file:
searchd
{
[...]
binlog_path = /path/to/writable/directory
[...]
}