How to resolve "Invalid Sequence Token" when using cloudwatch agent? - amazon-cloudwatchlogs

I'm seeing the following warning in the /var/log/amazon/amazon-cloudwatch-agent/amazon-cloudwatch-agent.log:
2021-10-06T06:39:23Z W! [outputs.cloudwatchlogs] Invalid SequenceToken used, will use new token and retry: The given sequenceToken is invalid. The next expected sequenceToken is: 49619410836690261519535138406911035003981074860446093650
But there is no mention about which file is really the one that it's failing. Not even when I add "debug": true to the /opt/aws/amazon-cloudwatch-agent/bin/config.json.
cat /opt/aws/amazon-cloudwatch-agent/bin/config.json|jq .agent
{
"metrics_collection_interval": 60,
"debug": true,
"run_as_user": "root"
}
I have many (28) files in my .logs.logs_collected.files.collect_list section of the config.json file, so how can I find which file is exactly causing trouble?

As of 2021-11-29 a PR to improve the log messages has been merged to the cloudwatch-agent but a new version of the cloudwatch-agent has not been released yet, the next version after v1.247349.0 will likely include a fix for this.
The fix will change the log statements to say
INFO: First time sending logs to %v/%v since startup so sequenceToken is nil, learned new token: xxxx: yyyy: This is an INFO message, as this behaviour is expected at startup for example.
WARN: Invalid SequenceToken used (%v) while sending logs to %v/%v, will use new token and retry: xxxxxv: This on the other hand is not expected and may mean that someone else is writing to the loggroup/logstream concurrently.
If those warnings come right after a restart of the cloudwatch agent (cwagent) then you can safely ignore them, it's expected behaviour . The cloudwatch agent does not save the next sequence token in its persistent state so on restart it will "learn" the correct sequence number by issuing a PutLogEvent with no sequence token at all, that returns an InvalidSequenceTokenException with the next sequence token to use. So it's expected to see those at startup, anyway I proposed a PR to amazon-cloudwatch-agent to improve those log messages.
If the "Invalid SequenceToken used" is seen long after the restart then you may have other issues.
The "Invalid SequenceToken used" error usually means that two entities/sources are trying to write to the same log group/log stream as mentioned in 2 (which is really for the old awslogs agent but still useful):
Caught exception: An error occurred (InvalidSequenceTokenException)
when calling the PutLogEvents operation: The given sequenceToken is
invalid[…] -or- Multiple agents might be sending log events to log
stream[…] – You can't push logs from multiple log files to a single
log stream. Update your configuration to push each log to a log
stream-log group combination.
I could be that the amazon cloudwatch agent itself it's trying to upload the same file twice because you have duplicates in your config.json.
So first print all your log group / log stream pairs in your config.json with:
cat /opt/aws/amazon-cloudwatch-agent/bin/config.json|jq -r '.logs.logs_collected.files.collect_list[]|"\(.log_group_name) \(.log_stream_name)"'|sort
which should give an output similar to:
/tableauserver/apigateway apigateway_node5-0.log
/tableauserver/apigateway control_apigateway_node5-0.log
/tableauserver/appzookeeper appzookeeper-discovery_node5-1.log
...
/tableauserver/vizqlserver vizqlserver_node5-3.log
Then you can use uniq -d to find the duplicates in that list with:
cat /opt/aws/amazon-cloudwatch-agent/bin/config.json|jq -r '.logs.logs_collected.files.collect_list[]|"\(.log_group_name) \(.log_stream_name)"'|sort|uniq -d
# The list should be empty otherwise you have duplicates
If that command produces any output it means that you have duplicates in your config.json collect_list.
I personally think that cwagent itself should print the "offending" loggroup/logstream in the logs so I opened in issue in amazon-cloudwatch-agent GitHub page.

Related

Kubernetes DSE Cassandra CommitLogReplayer$CommitLogReplayException

I have installed Cassandra on Kubernetes (9 pods) All the pods are up and running except
for one pod, which shows the below error.
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Encountered bad header at position 47137 of commit log /var/lib/cassandra/commitlog/CommitLog-600-1630582314923.log, with bad position but valid CRC
at org.apache.cassandra.db.commitlog.CommitLogReplayer.shouldSkipSegmentOnError(CommitLogReplayer.java:438)
at org.apache.cassandra.db.commitlog.CommitLogReplayer.handleUnrecoverableError(CommitLogReplayer.java:452)
at org.apache.cassandra.db.commitlog.CommitLogSegmentReader$SegmentIterator.computeNext(CommitLogSegmentReader.java:109)
at org.apache.cassandra.db.commitlog.CommitLogSegmentReader$SegmentIterator.computeNext(CommitLogSegmentReader.java:84)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
at org.apache.cassandra.db.commitlog.CommitLogReader.readCommitLogSegment(CommitLogReader.java:236)
at org.apache.cassandra.db.commitlog.CommitLogReader.readAllFiles(CommitLogReader.java:134)
at org.apache.cassandra.db.commitlog.CommitLogReplayer.replayFiles(CommitLogReplayer.java:154)
at org.apache.cassandra.db.commitlog.CommitLog.recoverFiles(CommitLog.java:213)
at org.apache.cassandra.db.commitlog.CommitLog.recoverSegmentsOnDisk(CommitLog.java:194)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:338)
at com.datastax.bdp.server.DseDaemon.setup(DseDaemon.java:527)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:702)
at com.datastax.bdp.DseModule.main(DseModule.java:96)
Caused by: org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException: Encountered bad header at position 47137 of commit log /var/lib/cassandra/commitlog/CommitLog-600-1630582314923.log, with bad position but valid CRC
at org.apache.cassandra.db.commitlog.CommitLogSegmentReader$SegmentIterator.computeNext(CommitLogSegmentReader.java:111)
... 12 more
ERROR [main] 2021-09-06 06:19:08,990 JVMStabilityInspector.java:251 - JVM state determined to be unstable. Exiting forcefully due to:
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Encountered bad header at position 47137 of commit log /var/lib/cassandra/commitlog/CommitLog-600-1630582314923.log, with bad position but valid CRC
at org.apache.cassandra.db.commitlog.CommitLogReplayer.shouldSkipSegmentOnError(CommitLogReplayer.java:438)
at org.apache.cassandra.db.commitlog.CommitLogReplayer.handleUnrecoverableError(CommitLogReplayer.java:452)
at org.apache.cassandra.db.commitlog.CommitLogSegmentReader$SegmentIterator.computeNext(CommitLogSegmentReader.java:109)
at org.apache.cassandra.db.commitlog.CommitLogSegmentReader$SegmentIterator.computeNext(CommitLogSegmentReader.java:84)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
at org.apache.cassandra.db.commitlog.CommitLogReader.readCommitLogSegment(CommitLogReader.java:236)
at org.apache.cassandra.db.commitlog.CommitLogReader.readAllFiles(CommitLogReader.java:134)
at org.apache.cassandra.db.commitlog.CommitLogReplayer.replayFiles(CommitLogReplayer.java:154)
at org.apache.cassandra.db.commitlog.CommitLog.recoverFiles(CommitLog.java:213)
at org.apache.cassandra.db.commitlog.CommitLog.recoverSegmentsOnDisk(CommitLog.java:194)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:338)
at com.datastax.bdp.server.DseDaemon.setup(DseDaemon.java:527)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:702)
at com.datastax.bdp.DseModule.main(DseModule.java:96)
Can someone help me out please
For whatever reason, one of the commit log segments got corrupted on the node.
You can workaround the issue by manually deleting this file on the pod:
/var/lib/cassandra/commitlog/CommitLog-600-1630582314923.log
Interestingly, that commit log segment was created on September 2 (1630582314923) but the log entry you posted was from September 6. This indicates something happened to the pod which resulted in the corrupted file.
You'll need to review the Cassandra logs on the pod (not the pod logs itself) to determine the root cause and address it. Cheers!

Cloudwatch logs "AND NOT" search

I'm searching Cloudwatch log events for errors with the following criteria:
?"error" ?"ERROR" ?"Error:"
How can I exclude specific terms from the result? For example, if I don't care about specific_error, how can I specify not to match on it?
I'm expecting to be able to do something like:
(?"error" AND -"specific_error") ?"ERROR" ?"Error:"
In the CloudWatch console, this can be accomplished with the - operand before the term you wish to exclude:
"error" -"something minor happened"
This is from the AWS docs for "Matching terms in log events".
Similarly, using aws logs tail, you can pass this to the --filter-pattern argument:
$ aws logs tail --format short /aws/lambda/my_lambda --filter-pattern '"error" -"something minor happened"' --since 3h
2021-07-09T19:28:47 error: something bad happened
2021-07-09T19:28:51 error: something bad happened
2021-07-09T19:29:52 error: something REALLY bad happened
2021-07-09T19:30:15 error: something CATASTROPHIC happened! Aiee!
2021-07-09T19:30:36 error: something bad happened

IBM BLUEMIX BLOCKCHAIN SDK-DEMO failing

I have been working with HFC SDK for Node.js and it used to work, but since last night I am having some problems.
When running helloblockchain.js only few times works, most time I get this error when it tries to enroll a new user:
E0113 11:56:05.983919636 5288 handshake.c:128] Security handshake failed: {"created":"#1484304965.983872199","description":"Handshake read failed","file":"../src/core/lib/security/transport/handshake.c","file_line":237,"referenced_errors":[{"created":"#1484304965.983866102","description":"FD shutdown","file":"../src/core/lib/iomgr/ev_epoll_linux.c","file_line":948}]}
Error: Failed to register and enroll JohnDoe: Error
Other times, the enroll works and the failure appears deploying the chaincode:
Enrolled and registered JohnDoe successfully
Deploying chaincode ...
E0113 12:14:27.341527043 5455 handshake.c:128] Security handshake failed: {"created":"#1484306067.341430168","description":"Handshake read failed","file":"../src/core/lib/security/transport/handshake.c","file_line":237,"referenced_errors":[{"created":"#1484306067.341421859","description":"FD shutdown","file":"../src/core/lib/iomgr/ev_epoll_linux.c","file_line":948}]}
Failed to deploy chaincode: request={"fcn":"init","args":["a","100","b","200"],"chaincodePath":"chaincode","certificatePath":"/certs/peer/cert.pem"}, error={"error":{"code":14,"metadata":{"_internal_repr":{}}},"msg":"Error"}
Or:
Enrolled and registered JohnDoe successfully
Deploying chaincode ...
E0113 12:15:27.448867739 5483 handshake.c:128] Security handshake failed: {"created":"#1484306127.448692244","description":"Handshake read failed","file":"../src/core/lib/security/transport/handshake.c","file_line":237,"referenced_errors":[{"created":"#1484306127.448668047","description":"FD shutdown","file":"../src/core/lib/iomgr/ev_epoll_linux.c","file_line":948}]}
events.js:160
throw er; // Unhandled 'error' event
^
Error
at ClientDuplexStream._emitStatusIfDone (/usr/lib/node_modules/hfc/node_modules/grpc/src/node/src/client.js:189:19)
at ClientDuplexStream._readsDone (/usr/lib/node_modules/hfc/node_modules/grpc/src/node/src/client.js:158:8)
at readCallback (/usr/lib/node_modules/hfc/node_modules/grpc/src/node/src/client.js:217:12)
E0113 12:15:27.563487641 5483 handshake.c:128] Security handshake failed: {"created":"#1484306127.563437122","description":"Handshake read failed","file":"../src/core/lib/security/transport/handshake.c","file_line":237,"referenced_errors":[{"created":"#1484306127.563429661","description":"FD shutdown","file":"../src/core/lib/iomgr/ev_epoll_linux.c","file_line":948}]}
This code worked yesterday, so I don't know what could be happening.
Does anybody know how can I fix it?
Thanks,
Javier.
ibm-bluemix
blockchain
These types of intermittent issues are usually related to GRPC. An initial suggestion is to ensure that you are using at least GRPC version 1.0.0.
If you are using a Mac, then the maximum number of open file descriptors should be checked (using ulimit -n). Sometimes this is initially set to a low value such as 256, so increasing the value could help.
There are a couple of GRPC issues with similar symptoms.
https://github.com/grpc/grpc/issues/8732
https://github.com/grpc/grpc/issues/8839
https://github.com/grpc/grpc/issues/8382
There is a grpc.initial_reconnect_backoff_ms property that is mentioned in some of these issues. Increasing the value past the 1000 ms level might help reduce the frequency of issues. Below are instructions for how the helloblockchain.js file can be modified to set this property to a higher value.
Open the helloblockchain.js file in the Hyperledger Fabric Client example and find the enrollAndRegisterUsers function.
Add “grpc.initial_reconnect_backoff_ms": 5000 to the setMemberServicesUrl call.
chain.setMemberServicesUrl(ca_url, {
pem: cert, "grpc.initial_reconnect_backoff_ms": 5000
});
Add “grpc.initial_reconnect_backoff_ms": 5000 to the addPeer call.
chain.addPeer("grpcs://" + peers[i].discovery_host + ":" + peers[i].discovery_port,
{pem: cert, "grpc.initial_reconnect_backoff_ms": 5000
});
Note that setting the grpc.initial_reconnect_backoff_ms property may reduce the frequency of issues, but it will not necessarily eliminate all issues.
The connection to the eventhub that is made in the helloblockchain.js file can also be a factor. There is an earlier version of the Hyperledger Fabric Client that does not utilize the eventhub. This earlier version could be tried to determine if this makes a difference. After running git clone https://github.com/IBM-Blockchain/SDK-Demo.git, run git checkout b7d5195 to use this prior level. Before running node helloblockchain.js from a Node.js command window, the git status command can be used to check the code level that is being used.

Error: No chunks found for a file with Mongo gridFS

An error has crashed my application server and I can't seem to figure out what could be causing the issue. My application is built with Meteor and hosted on modulus.io. Here are my application logs:
Error: no chunks found for file, possibly corrupt
at /mnt/data/2/node_modules/mongodb/lib/mongodb/gridfs/gridstore.js:817:20
at /mnt/data/2/node_modules/mongodb/lib/mongodb/gridfs/gridstore.js:594:7
at /mnt/data/2/node_modules/mongodb/lib/mongodb/cursor.js:758:35
at Cursor.close (/mnt/data/2/node_modules/mongodb/lib/mongodb/cursor.js:989:5)
at Cursor.nextObject (/mnt/data/2/node_modules/mongodb/lib/mongodb/cursor.js:758:17)
at commandHandler (/mnt/data/2/node_modules/mongodb/lib/mongodb/cursor.js:727:14)
at /mnt/data/2/node_modules/mongodb/lib/mongodb/db.js:1916:9
at Server.Base._callHandler (/mnt/data/2/node_modules/mongodb/lib/mongodb/connection/base.js:448:41)
at /mnt/data/2/node_modules/mongodb/lib/mongodb/connection/server.js:481:18
at [object Object].MongoReply.parseBody (/mnt/data/2/node_modules/mongodb/lib/mongodb/responses/mongo_reply.js:68:5)
[2015-03-29T22:05:57.573Z] Application CRASH detected. Exit code 8.
Most probably this is a mongo bug with gridfs (has been fixed)
Writing two or more different files concurrently from different node
processes using the GridStore.writeFile command results in some files
not being correctly written (ending up with a number of corrupt files
in the gridstore). Ending up with corrupt files even with all
writeFile calls being successfull and no indication of error.
writeFile occasionally fails with error "chunks out of order", but
this happens very rarely (something like 1 failed writeFile for 100
corrupt files or more).
Based on the comments with in a discussion, the problem will be fixed if you will update mongo (the gridfs files should be removed, as they are corrupt).
Error: no chunks found for file, possibly corrupt
at /home/developer/rundir/node_modules/mongoose/node_modules/mongodb/lib/mongodb/gridfs/gridstore.js:808:20
at /home/developer/rundir/node_modules/mongoose/node_modules/mongodb/lib/mongodb/gridfs/gridstore.js:586:5
at /home/developer/rundir/node_modules/mongoose/node_modules/mongodb/lib/mongodb/collection/query.js:164:5
at /home/developer/rundir/node_modules/mongoose/node_modules/mongodb/lib/mongodb/cursor.js:778:35
I had a similar occurance, but it ended up the file sought in a GFS read stream had actually been deleted - so in my case it wasn't corrupt, but gone! Above is a log from when that happened.

NFS mount points are going off/NFS compound failed for server mashost

We have an application in solaris during specific test case we will generate heap dump which will be written in to the server at specific path during this case we are getting following error in trace file
java.lang.OutOfMemoryError: Java heap space
Dumping heap to /ossrc/upgrade/JREheapdumps/java_pid16092.hprof ...
Dump file is incomplete: I/O error
and in /var/adm/messages we could see
Oct 28 13:00:10 ossuas2 nfs: [ID 733954 kern.info] NOTICE: [NFS4][Server: mashost][Mntpt: /ossrc/upgrade]NFS server mashost not
responding; still trying
Oct 28 13:02:53 ossuas2 nfs: [ID 733954 kern.info] NOTICE: [NFS4][Server: mashost][Mntpt: /usr/local]NFS server mashost not
responding; still trying
Oct 28 13:04:53 ossuas2 nfs: [ID 733954 kern.info] NOTICE: [NFS4][Server: mashost][Mntpt: /etc/opt/ericsson]NFS server mashost not
responding; still trying
Can anyone please help here why we are getting this problem and can any tell us can an application cause this impact on mashost ..????
First things first, check out the NFS service w/ svcbundle and svcs -- when it crashes, run:
# svcs -x nfs/client
on the client, and
# svcs -x nfs/server
on the server. I would expect one or both to be in a "maintenance" state. (You may see it fails to start properly at all). If it is in a maintenance mode, you should see a row marked "Reason:" that says why.
You might see "offline" -- in that case, startd will attempt to reboot the service multiple times and, if it fails after five attempts or hangs indefinitely, places it into "maintenance" state and stops restarting.
Check the logs in
/var/svc/log/<service-name FMRI>.log
There will be one on your client machine under "network-nfs-client:default" (probably, may have a name other than 'default' if it's been changed manually), and one on the server under "network-nfs-server:default"
See what you can glean from those.
svcbundle is all the time taking snapshots as backups of services, so you can try reverting to one of those.
# svcs -s nfs/server:default
svc:/network/nfs/server:default> listsnap
svc:/network/nfs/server:default> revert start [name_of_snapshot]
svc:/network/nfs/server:default> quit
# svcadm refresh nfs/server:default
# svcadm restart nfs/server:default
Make sure to include the ":default" tag, or if you saw a different tag from "svcs nfs/server" include it, that name defines an instance of the service, every running service is an instance.
If the process is failing to boot, you might have to look at the XML manifest under /lib/svc/manifest/network/nfs/ -- inside, you'll see dependencies (and services dependent on this one), then "exec_method"s, which define how the service starts, stops and restarts.
Instead of snapshots, you can can also restore it to default: use svccfg -s <FMRI> delete to clear it, then svcadm refresh <FMRI> and svcadm enable <FMRI>.
If the service is in maintenance state, once you've isolated and fixed the problem, you can manually clear that state by running svcadm clear <FMRI>.