I've just spent a couple of hours trying to detect resets on my system from the console log, where I have:
Boot up message // 1st console output on a boot
Shutdown message // Last console output on a CLEAN shutdown
By using grep for the 2 lines above, on the console output, I have text that contains something like:
Boot up message
Shutdown message
Boot up message
Shutdown message
Boot up message
Boot up message
Shutdown message
Where the reset is detectable by 2 consecutive Boot messages. I have a few thousand cycles to go through, so want to be able to use the '-n' switch with grep to print the cycle numbers, giving something like:
1:Boot up message
2-Shutdown message
3:Boot up message
4-Shutdown message
5:Boot up message
6:Boot up message // reset occurred here
7-Shutdown message
How can I use sed/grep (in Cygwin) to find only the consecutive Boot messages?
You can use this sed:
sed -n 'N;/^\(Boot.*\)\n\1/=' file
(OR)
sed -n 'N;/^\(Boot.*\)\n\1/p' file
This will output a line number on which consecutive messages.
Test:
$ cat file
Boot up message
Shutdown message
Boot up message
Shutdown message
Boot up message
Boot up message
Shutdown message
$ sed -n 'N;/^\(Boot.*\)\n\1/p' file
Boot up message
Boot up message
Related
I'm using the python-O365 library in order to access my O365 mailbox. The project requires me to execute the program in a docker container. If I start the program manually (as root), everything works fine, but if I try to start it via cron, it stalls on DEBUG Starting new HTTPS connection (1): login.microsoftonline.com:443, which I found out after activating logging.
The minimal code example that reproduces the error (with log):
import O365
from utils.credentials import get_credentials
import logging # We want to get additional information
logging.basicConfig(
filename='./easy_log_file.log',
filemode='a',
format='%(levelname)s %(message)s', # %(asctime)s %(pathname)s %(lineno)d
level=logging.DEBUG
)
filename = "o365_token.txt"
token_backend = O365.FileSystemTokenBackend(token_path = filename)
account = O365.Account(get_credentials(), token_backend=token_backend)
inbox = account.mailbox().inbox_folder()
messages = inbox.get_messages()
for message in messages:
logging.info(message)
logging.info("finished")
To start it via cron, I used the following command:
echo "15 21 * * * bash /workspace/daemon_start.sh >> /workspace/cronlogs/logs_daemon_mail.log" | crontab. If I start the program manually, the log continues like this:
DEBUG Starting new HTTPS connection (1): graph.microsoft.com:443
DEBUG https://graph.microsoft.com:443 "GET /v1.0/me/mailFolders/Inbox/messages?%24top=100&%24filter=isRead+eq+false HTTP/1.1" 200 None
DEBUG Received response (200) from URL https://graph.microsoft.com/v1.0/me/mailFolders/Inbox/messages?%24top=100&%24filter=isRead+eq+false
If the program is started via cron, sometimes the log continues like this:
DEBUG Incremented Retry for (url='/common/oauth2/v2.0/token'): Retry(total=2, connect=3, read=3, redirect=None, status=None)
WARNING Retrying (Retry(total=2, connect=3, read=3, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1131)'))': /common/oauth2/v2.0/token
In order to resolve the issue, I added my proxy by using account = Account(credentials, token_backend=token_backend, proxy_server="proxy.my_proxy.com"). It's strange, that I would have to add it, for the container is already configured to use this proxy. When I tried it with this setting, I encountered the same issue, only that the log when started with cron was continued always and much faster.
Since I think, that cron simply starts the program and does not meddle with the connections, it doesn't make sense to me, that I get different outcomes by starting it manually or with cron.
I'm seeing the following warning in the /var/log/amazon/amazon-cloudwatch-agent/amazon-cloudwatch-agent.log:
2021-10-06T06:39:23Z W! [outputs.cloudwatchlogs] Invalid SequenceToken used, will use new token and retry: The given sequenceToken is invalid. The next expected sequenceToken is: 49619410836690261519535138406911035003981074860446093650
But there is no mention about which file is really the one that it's failing. Not even when I add "debug": true to the /opt/aws/amazon-cloudwatch-agent/bin/config.json.
cat /opt/aws/amazon-cloudwatch-agent/bin/config.json|jq .agent
{
"metrics_collection_interval": 60,
"debug": true,
"run_as_user": "root"
}
I have many (28) files in my .logs.logs_collected.files.collect_list section of the config.json file, so how can I find which file is exactly causing trouble?
As of 2021-11-29 a PR to improve the log messages has been merged to the cloudwatch-agent but a new version of the cloudwatch-agent has not been released yet, the next version after v1.247349.0 will likely include a fix for this.
The fix will change the log statements to say
INFO: First time sending logs to %v/%v since startup so sequenceToken is nil, learned new token: xxxx: yyyy: This is an INFO message, as this behaviour is expected at startup for example.
WARN: Invalid SequenceToken used (%v) while sending logs to %v/%v, will use new token and retry: xxxxxv: This on the other hand is not expected and may mean that someone else is writing to the loggroup/logstream concurrently.
If those warnings come right after a restart of the cloudwatch agent (cwagent) then you can safely ignore them, it's expected behaviour . The cloudwatch agent does not save the next sequence token in its persistent state so on restart it will "learn" the correct sequence number by issuing a PutLogEvent with no sequence token at all, that returns an InvalidSequenceTokenException with the next sequence token to use. So it's expected to see those at startup, anyway I proposed a PR to amazon-cloudwatch-agent to improve those log messages.
If the "Invalid SequenceToken used" is seen long after the restart then you may have other issues.
The "Invalid SequenceToken used" error usually means that two entities/sources are trying to write to the same log group/log stream as mentioned in 2 (which is really for the old awslogs agent but still useful):
Caught exception: An error occurred (InvalidSequenceTokenException)
when calling the PutLogEvents operation: The given sequenceToken is
invalid[…] -or- Multiple agents might be sending log events to log
stream[…] – You can't push logs from multiple log files to a single
log stream. Update your configuration to push each log to a log
stream-log group combination.
I could be that the amazon cloudwatch agent itself it's trying to upload the same file twice because you have duplicates in your config.json.
So first print all your log group / log stream pairs in your config.json with:
cat /opt/aws/amazon-cloudwatch-agent/bin/config.json|jq -r '.logs.logs_collected.files.collect_list[]|"\(.log_group_name) \(.log_stream_name)"'|sort
which should give an output similar to:
/tableauserver/apigateway apigateway_node5-0.log
/tableauserver/apigateway control_apigateway_node5-0.log
/tableauserver/appzookeeper appzookeeper-discovery_node5-1.log
...
/tableauserver/vizqlserver vizqlserver_node5-3.log
Then you can use uniq -d to find the duplicates in that list with:
cat /opt/aws/amazon-cloudwatch-agent/bin/config.json|jq -r '.logs.logs_collected.files.collect_list[]|"\(.log_group_name) \(.log_stream_name)"'|sort|uniq -d
# The list should be empty otherwise you have duplicates
If that command produces any output it means that you have duplicates in your config.json collect_list.
I personally think that cwagent itself should print the "offending" loggroup/logstream in the logs so I opened in issue in amazon-cloudwatch-agent GitHub page.
I use filebeat with elk. I started it with nohup command.
nohup ./filebeat -e -c filebeat.yml -d "publish" > filebeat.log &
Application stopped automatically after one day. close_inactive parameter is not work. Is there any configuration that i missed for this problem.
2020-10-22T09:55:36.814+0100 INFO crawler/crawler.go:165 Crawler stopped
2020-10-22T09:55:36.815+0100 INFO registrar/registrar.go:367 Stopping Registrar
2020-10-22T09:55:36.815+0100 INFO registrar/registrar.go:293 Ending Registrar
2020-10-22T09:55:36.820+0100 INFO [monitoring] log/log.go:153 Total non-zero metrics {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":10540,"time":{"ms":10547}},"total":{"ticks":68190,"time":{"ms":68203},"value":68190},"user":{"ticks":57650,"time":{"ms":57656}}},"handles":{"limit":{"hard":16000,"soft":16000},"open":10},"info":{"ephemeral_id":"b57f1c4d-7a80-4f1f-aaba-5ab9ee057757","uptime":{"ms":7119571}},"memstats":{"gc_next":22377264,"memory_alloc":11462592,"memory_total":18240359416,"rss":50831360},"runtime":{"goroutines":21}},"filebeat":{"events":{"added":528063,"done":528063},"harvester":{"closed":77,"open_files":0,"running":0,"started":77},"input":{"log":{"files":{"truncated":38}}}},"libbeat":{"config":{"module":{"running":0},"reloads":1},"output":{"events":{"acked":527884,"batches":4732,"failed":51426,"total":579310},"read":{"bytes":32364,"errors":4},"type":"logstash","write":{"bytes":180629879,"errors":19}},"pipeline":{"clients":0,"events":{"active":0,"filtered":179,"published":527884,"retry":99719,"total":528063},"queue":{"acked":527884}}},"registrar":{"states":{"cleanup":8,"current":38,"update":528063},"writes":{"success":4356,"total":4356}},"system":{"cpu":{"cores":8},"load":{"1":0.66,"15":0.52,"5":0.56,"norm":{"1":0.0825,"15":0.065,"5":0.07}}}}}}
2020-10-22T09:55:36.820+0100 INFO [monitoring] log/log.go:154 Uptime: 1h58m39.572210325s
2020-10-22T09:55:36.820+0100 INFO [monitoring] log/log.go:131 Stopping metrics logging.
2020-10-22T09:55:36.820+0100 INFO instance/beat.go:432 filebeat stopped.
What is the content of "filebeat.yml"? it can stop for example if you didn't define any paths.
Also, you might want to change the logging level to get more information as to what happened:
logging.level: debug
Stop the filebeat service and Run the Filebeat in debug mode from command line to check for any issue in your configuration using the command below from the filebeat home directory.
filebeat -e -c filebeat.yml -d "*"
I know that sacctmgr command can list the event history of nodes with the reason.
sacctmgr show event Start=09/01-00:00 format=nodename,timestart,timeend,state,reason,user
This command gives the following output
gnodeXX 2020-09-04T20:21:34 2020-09-05T01:21:38 DRAIN Kill task failed root(ZZ)
gnodeXX 2020-09-09T16:44:55 2020-09-09T17:50:21 DOWN* Not responding slurm(DDDD)
Is there any way I can get the jobId or username that caused the Node Failure or any info on the task for which kill failed? The user column gives either of two outputs for all results root(ZZ) and slurm(DDDD) and I am not sure what these imply.
I wanted to send data to the socket that my Flink program can read it from socket.
As this guidance https://stackoverflow.com/a/53943644/6640504 I did that with:
cqlsh -e “select * from tableName;” -k mykeyspace Ipaddress 9042 |
nc -lk portNumber
and my Flink program read data from socket without any errors.
Now I have a new issue which is Cassandra send data to the socket page by page and the Flink program just receive one page. I don’t know how to send another page automatically.
I disabled “paging” in Cassandra with “paging off”; but It showed me this error:
ReadFailure: Error from server: code=1300 [Replica(s) failed to
execute read] message="Operation failed - received 0 responses and 1
failures" info={'failures': 1, 'received_responses': 0, >'required_responses': 1, 'consistency': 'ONE'}
Would you please help me how I can send the whole result of this query “select * from tableName;” to the socket for the Flink program to use it?