Spurious dtrace user stack traces? - solaris

I am using dtrace to sample stack traces of a process like this:
# /usr/sbin/dtrace -p $MYPID -x ustackframes=100 stack.d -s -o stack.log
Where stack.d looks like:
profile-99
/execname == "mybinary"/
{
#[execname, ustack()] = count();
}
I am adding -p to dtrace such that it has the symbols during stack writing even when the user process has already exited.
This works - but I observe 3 issues:
During tracing I regularly get error messages like:
dtrace: error on enabled probe ID 1 (ID 12345: profile:::profile-99):\
invalid address (0x0) in action #3
dtrace: error on enabled probe ID 1 (ID 12345: profile:::profile-99):\
invalid address (0xffffffff00000000) in action #3
dtrace: error on enabled probe ID 1 (ID 12345: profile:::profile-99):\
invalid address (0x7ffe8000) in action #3
[..]
(i.e. many of those)
What is the underlying issue there?
The process executes like it should (i.e. no crashes).
The other thing is that a lot of recorded stacks don't contain any symbols at all.
What are possible reasons for such symbol-less stacks?
Also - some stacks do not start with _start -> main but still contain symbols (i.e. with function from the binary).
Are those incomplete stacks ones where dtrace somehow failed to get to the bottom?

Related

How to resolve "Invalid Sequence Token" when using cloudwatch agent?

I'm seeing the following warning in the /var/log/amazon/amazon-cloudwatch-agent/amazon-cloudwatch-agent.log:
2021-10-06T06:39:23Z W! [outputs.cloudwatchlogs] Invalid SequenceToken used, will use new token and retry: The given sequenceToken is invalid. The next expected sequenceToken is: 49619410836690261519535138406911035003981074860446093650
But there is no mention about which file is really the one that it's failing. Not even when I add "debug": true to the /opt/aws/amazon-cloudwatch-agent/bin/config.json.
cat /opt/aws/amazon-cloudwatch-agent/bin/config.json|jq .agent
{
"metrics_collection_interval": 60,
"debug": true,
"run_as_user": "root"
}
I have many (28) files in my .logs.logs_collected.files.collect_list section of the config.json file, so how can I find which file is exactly causing trouble?
As of 2021-11-29 a PR to improve the log messages has been merged to the cloudwatch-agent but a new version of the cloudwatch-agent has not been released yet, the next version after v1.247349.0 will likely include a fix for this.
The fix will change the log statements to say
INFO: First time sending logs to %v/%v since startup so sequenceToken is nil, learned new token: xxxx: yyyy: This is an INFO message, as this behaviour is expected at startup for example.
WARN: Invalid SequenceToken used (%v) while sending logs to %v/%v, will use new token and retry: xxxxxv: This on the other hand is not expected and may mean that someone else is writing to the loggroup/logstream concurrently.
If those warnings come right after a restart of the cloudwatch agent (cwagent) then you can safely ignore them, it's expected behaviour . The cloudwatch agent does not save the next sequence token in its persistent state so on restart it will "learn" the correct sequence number by issuing a PutLogEvent with no sequence token at all, that returns an InvalidSequenceTokenException with the next sequence token to use. So it's expected to see those at startup, anyway I proposed a PR to amazon-cloudwatch-agent to improve those log messages.
If the "Invalid SequenceToken used" is seen long after the restart then you may have other issues.
The "Invalid SequenceToken used" error usually means that two entities/sources are trying to write to the same log group/log stream as mentioned in 2 (which is really for the old awslogs agent but still useful):
Caught exception: An error occurred (InvalidSequenceTokenException)
when calling the PutLogEvents operation: The given sequenceToken is
invalid[…] -or- Multiple agents might be sending log events to log
stream[…] – You can't push logs from multiple log files to a single
log stream. Update your configuration to push each log to a log
stream-log group combination.
I could be that the amazon cloudwatch agent itself it's trying to upload the same file twice because you have duplicates in your config.json.
So first print all your log group / log stream pairs in your config.json with:
cat /opt/aws/amazon-cloudwatch-agent/bin/config.json|jq -r '.logs.logs_collected.files.collect_list[]|"\(.log_group_name) \(.log_stream_name)"'|sort
which should give an output similar to:
/tableauserver/apigateway apigateway_node5-0.log
/tableauserver/apigateway control_apigateway_node5-0.log
/tableauserver/appzookeeper appzookeeper-discovery_node5-1.log
...
/tableauserver/vizqlserver vizqlserver_node5-3.log
Then you can use uniq -d to find the duplicates in that list with:
cat /opt/aws/amazon-cloudwatch-agent/bin/config.json|jq -r '.logs.logs_collected.files.collect_list[]|"\(.log_group_name) \(.log_stream_name)"'|sort|uniq -d
# The list should be empty otherwise you have duplicates
If that command produces any output it means that you have duplicates in your config.json collect_list.
I personally think that cwagent itself should print the "offending" loggroup/logstream in the logs so I opened in issue in amazon-cloudwatch-agent GitHub page.

Debugging NdisTimedDataHang reported by Driver Verifier

I have enabled NDIS/WIFI verification flag of my driver in Driver Verifier. This resulted in BSOD for hitting the ndistimeddatahang rule. When I analyzed the dump, I got -
DRIVER_VERIFIER_DETECTED_VIOLATION (c4)
Arguments:
Arg1: 000000000009200f, ID of the 'NdisTimedDataHang' rule
that was violated.
Arg2: fffff806cd819200, A pointer to the string
describing the violated rule condition.
Arg3: ffff87862606b110,
Address of internal rule state (second argument to !ruleinfo).
Arg4:
ffff87862606b240, Address of supplemental states (third argument to
!ruleinfo).
When I did !ndiskd.pendingnbls, I got the list of NBLs that are currently pending while the dump was taken. To figure out, which NBL has caused the violation, I tried to use !ruleinfo command with the arguments received in analyzing.
ruleinfo 0x9200f 0xffff87862606b110 0xffff87862606b240
but Windbg reported the error -
Failed to read the rule state (check the second argument).
What am I doing wrong ? Is there any way to figure out which NBL failed to complete in 22 seconds which is a requirement for ndistimeddatahang rule ?

Solaris svcs command shows wrong status

I have freshly installed an application on solaris 5.10 . When checked through ps -ef | grep hyperic | grep agent, process are up and running . When checked the status through svcs hyperic-agent command, the output shows that the agent is in maintenance mode . Application is working fine and I dont have any issues with the application . Please help
There are several reasons that lead to that behavior:
Starter (start/exec property of service) returned status that is different from SMF_EXIT_OK (zero). Than you may check logs:
# svcs -x ssh
...
See: /var/svc/log/network-ssh:default.log
If you check logs, you may see following messages that means, starter script failed or incorrectly written:
[ Aug 11 18:40:30 Method "start" exited with status 96 ]
Another reason for such behavior is that service faults during while its working (i.e. one of processes coredumps or receives kill signal or all processes exits) as described here: https://blogs.oracle.com/lianep/entry/smf_5_fault_retry_models
The actual system that provides SMF facilities for monitoring that is System Contracts. You may determine contract ID of online service with svcs -v (field CTID):
# svcs -vp svc:/network/smtp:sendmail
STATE NSTATE STIME CTID FMRI
online - Apr_14 68 svc:/network/smtp:sendmail
Apr_14 1679 sendmail
Apr_14 1681 sendmail
Than watch events with ctwatch:
# ctwatch 68
CTID EVID CRIT ACK CTTYPE SUMMARY
68 28 crit no process contract empty
Than there are two options to handle that:
There is a real problem with service so it eventually faults. Than debug the application.
It is normal behavior of service, so you should edit and re-import your service manifest, to make SMF less paranoid. I.e. configure ignore_error and duration properties.

NFS mount points are going off/NFS compound failed for server mashost

We have an application in solaris during specific test case we will generate heap dump which will be written in to the server at specific path during this case we are getting following error in trace file
java.lang.OutOfMemoryError: Java heap space
Dumping heap to /ossrc/upgrade/JREheapdumps/java_pid16092.hprof ...
Dump file is incomplete: I/O error
and in /var/adm/messages we could see
Oct 28 13:00:10 ossuas2 nfs: [ID 733954 kern.info] NOTICE: [NFS4][Server: mashost][Mntpt: /ossrc/upgrade]NFS server mashost not
responding; still trying
Oct 28 13:02:53 ossuas2 nfs: [ID 733954 kern.info] NOTICE: [NFS4][Server: mashost][Mntpt: /usr/local]NFS server mashost not
responding; still trying
Oct 28 13:04:53 ossuas2 nfs: [ID 733954 kern.info] NOTICE: [NFS4][Server: mashost][Mntpt: /etc/opt/ericsson]NFS server mashost not
responding; still trying
Can anyone please help here why we are getting this problem and can any tell us can an application cause this impact on mashost ..????
First things first, check out the NFS service w/ svcbundle and svcs -- when it crashes, run:
# svcs -x nfs/client
on the client, and
# svcs -x nfs/server
on the server. I would expect one or both to be in a "maintenance" state. (You may see it fails to start properly at all). If it is in a maintenance mode, you should see a row marked "Reason:" that says why.
You might see "offline" -- in that case, startd will attempt to reboot the service multiple times and, if it fails after five attempts or hangs indefinitely, places it into "maintenance" state and stops restarting.
Check the logs in
/var/svc/log/<service-name FMRI>.log
There will be one on your client machine under "network-nfs-client:default" (probably, may have a name other than 'default' if it's been changed manually), and one on the server under "network-nfs-server:default"
See what you can glean from those.
svcbundle is all the time taking snapshots as backups of services, so you can try reverting to one of those.
# svcs -s nfs/server:default
svc:/network/nfs/server:default> listsnap
svc:/network/nfs/server:default> revert start [name_of_snapshot]
svc:/network/nfs/server:default> quit
# svcadm refresh nfs/server:default
# svcadm restart nfs/server:default
Make sure to include the ":default" tag, or if you saw a different tag from "svcs nfs/server" include it, that name defines an instance of the service, every running service is an instance.
If the process is failing to boot, you might have to look at the XML manifest under /lib/svc/manifest/network/nfs/ -- inside, you'll see dependencies (and services dependent on this one), then "exec_method"s, which define how the service starts, stops and restarts.
Instead of snapshots, you can can also restore it to default: use svccfg -s <FMRI> delete to clear it, then svcadm refresh <FMRI> and svcadm enable <FMRI>.
If the service is in maintenance state, once you've isolated and fixed the problem, you can manually clear that state by running svcadm clear <FMRI>.

PHP - sometimes timeouts while accessing Magento via SOAP

I've got a simple script which should:
1. Get all customers from Magento (into an array)
2. Get their full addresses (iterate through the array with a foreach)
2a. sleep for 3 sec
3. Get their order history (same foreach)
3a. sleep once more for 3 sec
I'm doing this from the command line with PHP CLI.
The script runs for a couple of minutes - sometimes even half an hour but most often the script sees an error and can't fully iterate through all data:
PHP Fatal error: Uncaught SoapFault exception: [HTTP] Could not connect to host in /var/www/soap/mag_crm.php:162
Stack trace:
#0 [internal function]: SoapClient->__doRequest('<?xml version="...', 'http://myurl...', 'urn:Mage_Api_Mo...', 1, 0)
#1 /var/www/soap/mag_crm.php(162): SoapClient->__call('salesOrderList', Array)
#2 /var/www/soap/mag_crm.php(162): SoapClient->salesOrderList('fd66fc18e4b8...', Array)
#3 /var/www/soap/mag_crm.php(85): fetchAllOrders(259)
#4 {main}
thrown in /var/www/soap/mag_crm.php on line 162
How could I improve this script or in case of an error like this - just retry or so?
The code itself is easy - just plain function calls like that for instance:
$aOrders = fetchAllOrders ( $oCustomer->customer_id );
(within the foreach)
you need to change your magento settings to :
System -> Index Management
Check All Items
Change Action to "Change Index Mode"
Select "Manual"
Save
The problem is that the site tries to refresh the picture all the time and kill the session.