How to find process by dmesg message in Solaris? - solaris

There is a "Oracle Solaris 11.4" system, where we have a flood messages in dmesg.
genunix: [ID 200113 kern.warning] WARNING: symlink creation failed,
error 2
This messages appears every 15 minutes, but I didn't find any crontab job with this interval starting.
Is there a way to know what is the process runs every 15 minutes?
May I use dtrace or something else?
Thanks

Related

Airflow example DAGs take a long time in GUI and the scheduler is showing this: DagFileProcessorManager (PID=...) last sent a heartbeat

While running standard Airflow examples with airflow 2.1.2, DAGs are taking a long time to complete. On every DAG run, this problem occurs. The problem happens when running from the airflow GUI. It isn't a problem when running as a test from the airflow command line. Looking at scheduler log as it runs, this is what is apparent: after a task runs, apparently the DagFileProcessorManager has to be restarted for it to continue to the next tasks, which take 1 to 2 minutes. The restart happens after the absence of heartbeat responses, and this error shows:
{dag_processing.py:414} ERROR - DagFileProcessorManager (PID=67503) last sent a heartbeat 64.25 seconds ago! Restarting it
Question: How can I fix this?
This fixed the problem:
(1) Use postgresql instead of sqlite.
(2) Switch from SequentialExecutor to LocalExecutor.
Just to add to that - we had other similar reports and we decided to make a very clear warning in such case in the UI (will be released in the next version):
https://github.com/apache/airflow/pull/17133

Batch account node restarted unexpectedly

I am using an Azure batch account to run sqlpackage.exe in order to move databases from a server to another. A task that has started 6 days ago has suddenly been restarted and started from the beginning after 4 days of running (extremely large databases). The task run uninterruptedly up until then and should have continued to run for about 1-2 days.
The PowerShell script that contains all the logic handles all the exceptions that could occur during the execution. Also, the retry count for the task was set to 0 in case it fails.
Unfortunately, I did not have diagnostics settings configured and I could only look at the metrics and there was a short period when there wasn't any node.
What can be the causes for this behavior? Restarting while the node is still running
Thanks
Unfortunately, there is no way to give a definitive answer to this question. You will need to dig into the compute node (interactively log in) and check system logs to give you details on why the node restarted. There is no guarantee that a compute node will have 100% uptime as there may be hardware faults or other service interruptions.
In general, it's best practice to have long running tasks checkpoint progress combined with a retry policy. Programs that can reload state can pick up at the time of the checkpoint when the Batch service automatically reschedules the task execution. Please see the Batch best practices guide for more information.

Incorrect failure notification from Rundeck during fall time change

Last night was "fall back" time change for most locations in the US. I woke up this morning to find dozens of job failure notifications. Almost all of them though were incorrect: the jobs showed as having completed normally, yet Rundeck sent a failure notification for it.
Interestingly, this happened in two completely separate Rundeck installations (v2.10.8-1 and v3.1.2-20190927). The commonality is that they're both on CentOS 7 (separate servers). They're both using MariaDB, although different versions of MariaDB.
The failure emails for the jobs that finished successfully showed a negative time in the "Scheduled after" line:
#1,811,391
by admin Scheduled after 59m at 1:19 AM
• Scheduled after -33s - View Output »
• Download Output
Execution
User: admin
Time: 59m
Started: in 59m 2019-11-03 01:19:01.0
Finished: 1s ago Sun Nov 03 01:19:28 EDT 2019
Executions Success rate Average duration
100% -45s
That job actually ran in 27s at 01:19 EDT (the first 1am hour, it is now EST). Looking at the email headers, I believe I got the message at 1:19 EST, an hour after the job ran.
So that would seem to imply to me that it's just a notification problem (somehow).
But there were a couple of jobs that were following other job executions that failed as well, apparently because the successfully finished job returned a RC 2. I'm not sure what to make of this.
We've been running Rundeck for a few years now, this is the first I remember seeing this problem. Of course my memory may be faulty--maybe we did see it previously, only there were fewer jobs affected or some such.
The fact that it impacted two different versions of Rundeck on two different servers implies either it's a fundamental issue with Rundeck that's been around for a while or it is something else in the operating system that's somehow causing problems for Rundeck. (Although time change isn't new, so that would seem to be somewhat surprising too.)
Any thoughts about what might have gone on (and how to prevent it next year, short of the obvious run on UTC) would be appreciated.
You can define specific Timezone in Rundeck, check this and this.

SSIS Transfer Objects task fails when run from Agent

I am using the SSIS Transfer Objects task to transfer a database from one server to another. This is a nightly task as the final part of ETL.
If I run the task manually during the day, there is no problem. It completes in around 60 to 90 minutes.
When I run the task from Agent, it always starts but often fails . I have the agent steps set up to rety on failure, but most nights it is taking 3 attempts. On some nights 5 or 6 attempts.
The error message returned is two fold (both error messages show in the log for the same row):-
1) An error occurred while transferring data. See the inner exception for details.
2) Timeout expired: The timeout period elapsed prior to completion of the operation or the server is not responding
I can't find any timeout limit to adjust that I haven't already adjusted.
Anyone have any ideas?

Lustre hangs while mounting oss

I have installed Parallel file system "Lustre" along with this slide with RPM.
Have set node A, B.
Installed mds and mdt to node A. Its mount was successful.
But, After format oss to node B using mkfs.lustre, then I mounted it, but it began Infinite waiting.
And it retrieve this error once 120 seconds.
INFO: task mount.lustre:1541 blocked for more than 120 seconds.
Not tainted 2.6.32-504.8.1.el6_lustre.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Why it occurs? Or can you give me better tutorial or experience? Its version of Lustre is 2.7.0.
Thanks a lot.
It is a info message. As mentioned in the message, though you can echo 0 to "hung_task_timeout_secs" to disable the message from showing up but still I will not recommend it.
Try to lower the mark for flushing the cache from 40% to 10% by setting “vm.dirty_ratio=5″ & "vm.dirty_background_ratio=5" in /etc/sysctl.conf. Activate it by using sysctl -p command, there is no need to reboot the system.