Simply distributed index: precached 0 indexes - sphinx

I have two simple indexes:
First, 01.conf:
searchd
{
listen = 9301
listen = 9401:mysql41
pid_file = /var/run/sphinxsearch/searchd01.pid
log = /var/log/sphinxsearch/searchd01.log
query_log = /var/log/sphinxsearch/query01.log
binlog_path = /var/lib/sphinxsearch/data/test/01
}
source base
{
type = mysql
sql_host = localhost
sql_db = test
sql_user = root
sql_pass = toor
sql_query_pre = SET NAMES utf8
sql_attr_uint = group_id
}
source test : base
{
sql_query = \
SELECT id, group_id, UNIX_TIMESTAMP(date_added) AS date_added, title, content \
FROM documents WHERE id % 2 = 0
}
index test
{
source = test
path = /var/lib/sphinxsearch/data/test/01
}
Second looks like first but with "02" instead "01" in filename and inside.
And distributed index in 00.conf:
searchd
{
listen = 9305
listen = 9405:mysql41
pid_file = /var/run/sphinxsearch/searchd00.pid
log = /var/log/sphinxsearch/searchd00.log
query_log = /var/log/sphinxsearch/query00.log
binlog_path = /var/lib/sphinxsearch/data/test
}
index test
{
type = distributed
agent = 127.0.0.1:9301:test
agent = 127.0.0.1:9302:test
}
And I try to use distributed index:
sudo searchd --config /etc/sphinxsearch/d/00.conf --stop
sudo searchd --config /etc/sphinxsearch/d/01.conf --stop
sudo searchd --config /etc/sphinxsearch/d/02.conf --stop
sudo searchd --config /etc/sphinxsearch/d/01.conf
sudo searchd --config /etc/sphinxsearch/d/02.conf
sudo indexer --all --rotate --config /etc/sphinxsearch/d/01.conf
sudo indexer --all --rotate --config /etc/sphinxsearch/d/02.conf
sudo searchd --config /etc/sphinxsearch/d/00.conf
Unfortunately I obtain next output:
...
using config file '/etc/sphinxsearch/d/00.conf'...
listening on all interfaces, port=9305
listening on all interfaces, port=9405
precached 0 indexes in 0.000 sec
Why?
And when I try to search something with distributed index (9305):
no enabled local indexes to search.
And mysql indexes are works perfectly if I use them with port 9301 and 9302 respectively. But searching in distributed index returns nothing.
UPDATE
# tail /var/log/sphinxsearch/searchd00.log
[Thu Sep 29 23:43:04.599 2016] [ 2353] binlog: finished replaying /var/lib/sphinxsearch/data/test/binlog.001; 0.0 MB in 0.000 sec
[Thu Sep 29 23:43:04.599 2016] [ 2353] binlog: finished replaying total 4 in 0.000 sec
[Thu Sep 29 23:43:04.599 2016] [ 2353] accepting connections
[Thu Sep 29 23:43:24.336 2016] [ 2353] caught SIGTERM, shutting down
[Thu Sep 29 23:43:24.472 2016] [ 2353] shutdown complete
[Thu Sep 29 23:43:24.473 2016] [ 2352] watchdog: main process 2353 exited cleanly (exit code 0), shutting down
[Thu Sep 29 23:43:24.634 2016] [ 2404] watchdog: main process 2405 forked ok
[Thu Sep 29 23:43:24.635 2016] [ 2405] listening on all interfaces, port=9305
[Thu Sep 29 23:43:24.635 2016] [ 2405] listening on all interfaces, port=9405
[Thu Sep 29 23:43:24.636 2016] [ 2405] accepting connections
UPDATE2
Hmm... It seems what problem in querying data from Sphinx. Also I renamed distributed index into test1. Next code works well.
# mysql -h 127.0.0.1 -P 9405
mysql> select * from test1 where match ('one|two');
+------+----------+
| id | group_id |
+------+----------+
| 1 | 1 |
| 2 | 1 |
+------+----------+
2 rows in set (0,00 sec)
I think what problem was in old version of sphinxapi.php what I used.

precached 0 indexes in 0.000 sec
Well that it self, is normal. There are no local indexes to 'precache'. A distributed index has no index files to 'load' or (pre)cache.
... but searchd should still be running at the end of that. I think searchd should start up ok.
Try also checking
/var/log/sphinxsearch/searchd00.log
might have some more.
Although I suppose its possible sphinx will not startup without any real indexes (ie cant have JUST distributed index), so could just add a fake index to that config.

Related

Barman geo-redundancy configuration error

I trying to config the geo-redundancy on my barman setup, but i get an error when i try to copy from primary to secundary backup, my configuration is:
SERVER 1
Ubuntu 18 on Virtual Box
Postgres 12
192.168.0.103
/etc/barman.conf
backup_method = rsync
archiver = on
compression = gzip
reuse_backup = link
backup_options = concurrent_backup
parallel_jobs = 2
network_compression = true
basebackup_retry_times = 20
basebackup_retry_sleep = 120
/etc/barman.d/comauto_20200921_95
[comauto_20200921_95]
description = "Comauto Local Postgres 9.5 - 21/09/2020"
conninfo = host=192.168.0.102 user=barman dbname=postgres
ssh_command = ssh postgres#192.168.0.102
retention_policy = RECOVERY WINDOW OF 2 WEEKS
backup_options = exclusive_backup
SERVER 2
Ubuntu 18 on Virtual Box
Postgres 9.5
ifconfig = 192.168.0.102
/etc/barman.conf
backup_method = rsync
archiver = on
compression = gzip
reuse_backup = link
backup_options = concurrent_backup
parallel_jobs = 2
network_compression = true
basebackup_retry_times = 20
basebackup_retry_sleep = 120
; the only difference
primary_ssh_command = barman#192.168.0.103
/etc/barman.d/comauto_20200921_95
[comauto_20200921_95]
description = "Comauto Local Postgres 9.5 - 21/09/2020"
conninfo = host=192.168.0.102 user=barman dbname=postgres
ssh_command = ssh postgres#192.168.0.102
retention_policy = RECOVERY WINDOW OF 2 WEEKS
backup_options = exclusive_backup
On server 1:
sudo su barman
ssh barman#192.168.0.102 -C true
# OK
barman check comauto_20200921_95
# All OK
barman backup comauto_20200921_95
# OK
barman list-backup comauto_20200921_95
# comauto_20200921_95 20201111T172643 - Wed Nov 11 17:26:50 2020 - Size: 6.2 GiB - WAL Size: 0 B
# comauto_20200921_95 20201111T114656 - Wed Nov 11 11:47:08 2020 - Size: 6.2 GiB - WAL Size: 79.9 KiB
# comauto_20200921_95 20201111T112906 - Wed Nov 11 11:33:10 2020 - Size: 6.2 GiB - WAL Size: 96.4 KiB
The error happens here
On server 2:
sudo su barman
ssh barman#192.168.0.103 -C true
# OK
barman check comauto_20200921_95
# WAL archive: FAILED
# ssh: FAILED (Connection failed using 'barman#192.168.0.103 -o BatchMode=yes -o StrictHostKeyChecking=no' return code 127)
barman cron
# ERROR: Failed to retrieve the primary node status: sync-info execution on remote primary server comauto_20200921_95 failed: /bin/sh: 1: barman#192.168.0.103: not found
barman list-backup comauto_20200921_95
#
One obvious issue is that primary_ssh_command is missing the actual ssh; it should presumably be:
; the only difference
primary_ssh_command = ssh barman#192.168.0.103
See example in the documentation here: https://docs.pgbarman.org/#configuration-1

Telegraf inputs.tail with zimbra.log

I have some questions, how I can set telegraf.conf file for collect logs from the "zimbra.conf" file?
Now I tried to use this config text, but it does not work :(((
I want to send this logs to grafana
One of the lines "zimbra.conf" for example:
Oct 1 10:20:46 webmail postfix/smtp[7677]: BD5BAE9999: to=user#mail.com, relay=mo94.cloud.mail.com[92.97.907.14]:25, delay=0.73, delays=0.09/0.01/0.58/0.19, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 4C25fk2pjFz32N5)
And I do not understand exactly how works the "grok_patterns ="
[[inputs.tail]]
files = ["/var/log/zimbra.log"]
from_beginning = false
grok_patterns = ['%{SYSLOGTIMESTAMP:timestamp} %{SYSLOGHOST} %{DATA:program}(?:\[%{POSINT}\])?: %{GREEDYDATA:message}']
name_override = "zimbra_access_log"
grok_custom_pattern_files = []
grok_custom_patterns = '''
TS_UNIX %{MONTH}%{SPACE}%{MONTHDAY}%{SPACE}%{HOUR}:%{MINUTE}:%{SECOND}
TS_CUSTOM %{MONTH}%{SPACE}%{MONTHDAY} %{HOUR}:%{MINUTE}:%{SECOND}
'''
grok_timezone = "Local"
data_format = "grok"
I have copied your example line into a log file called Prueba.txt wich contains the following lines:
Oct 3 00:52:32 webmail postfix/smtp[7677]: BD5BAE9999: to=user#mail.com, relay=mo94.cloud.mail.com[92.97.907.14]:25, delay=0.73, delays=0.09/0.01/0.58/0.19, dsn=2.0.0, status=sent (250 2.0$
Oct 13 06:25:01 webmail systemd-logind[949]: New session 229478 of user zimbra.
Oct 13 06:25:02 webmail zmconfigd[27437]: Shutting down. Received signal 15
Oct 13 06:25:02 webmail systemd-logind[949]: Removed session c296.
Oct 13 06:25:03 webmail sshd[28005]: Failed password for invalid user julianne from 120.131.2.210 port 10570 ssh2
I have been able to parse the data with this configuration of the tail.input plugin:
[[inputs.tail]]
files = ["Prueba.txt"]
from_beginning = true
data_format = "grok"
grok_patterns = ['%{TIMESTAMP_ZIMBRA} %{GREEDYDATA:source} %{DATA:program}(?:\[%{POSINT}\])?: %{GREEDYDATA:message}']
grok_custom_patterns = '''
TIMESTAMP_ZIMBRA (\w{3} \d{1,2} \d{2}:\d{2}:\d{2})
'''
name_override = "log_frames"
You need to match the input string with regular expressions. For that there are some predefined patters such as GREEDYDATA = .* that you can use to match your input (another example will be NUMBER = (?:%{BASE10NUM}) BASE16NUM (?<![0-9A-Fa-f])(?:[+-]?(?:0x)?(?:[0-9A-Fa-f]+))) . You can also define your own patterns in grok_custom_patterns. Take a look at this website with some patters: https://streamsets.com/documentation/datacollector/latest/help/datacollector/UserGuide/Apx-GrokPatterns/GrokPatterns_title.html
In this case I defined a TIMESTAMP_ZIMBRA pattern for matching Oct 3 00:52:32 and Oct 03 00:52:33 alike inputs.
Here is the collected metric by Prometheus:
# HELP log_frames_delay Telegraf collected metric
# TYPE log_frames_delay untyped
log_frames_delay{delays="0.09/0.01/0.58/0.19",dsn="2.0.0",host="localhost.localdomain",message="BD5BAE9999:",path="Prueba.txt",program="postfix/smtp",relay="mo94.cloud.mail.com[92.97.907.14]:25",source="webmail",status="sent (250 2.0.0 Ok: queued as 4C25fk2pjFz32N5)",to="user#mail.com"} 0.73
P.D.: Ensure that telegraf has access to the log files.

Audit daemon does not take rules from audit.rules

I am unable to add rules to audit daemon using /etc/audit/audit.rules
Every time i add the rules using auditctl it gets removed on reboot or audit daemon restart I have attached the /etc/audit/audit.rules and /etc/audit/auditd.conf
cat /etc/audit/auditd.conf
$ cat /etc/audit/auditd.conf
#
# This file controls the configuration of the audit daemon
#
local_events = yes
write_logs = yes
log_file = /NU_Application/audit.log
log_group = root
log_format = RAW
flush = INCREMENTAL_ASYNC
freq = 50
max_log_file = 8
num_logs = 5
priority_boost = 4
disp_qos = lossy
dispatcher = /sbin/audispd
name_format = NONE
##name = mydomain
max_log_file_action = ROTATE
space_left = 75
space_left_action = SYSLOG
verify_email = yes
action_mail_acct = root
admin_space_left = 50
admin_space_left_action = SUSPEND
disk_full_action = SUSPEND
disk_error_action = SUSPEND
use_libwrap = yes
##tcp_listen_port = 22
tcp_listen_queue = 5
tcp_max_per_addr = 1
##tcp_client_ports = 1024-65535
tcp_client_max_idle = 0
enable_krb5 = no
krb5_principal = auditd
##krb5_key_file = /etc/audit/audit.key
distribute_network = no
cat /etc/audit/audit.rules
$ cat /etc/audit/audit.rules
## First rule - delete all
## Increase the buffers to survive stress events.
## Make this bigger for busy systems
-b 8192
## This determine how long to wait in burst of events
--backlog_wait_time 0
## Set failure mode to syslog
-f 1
-w /var/log/lastlog -p wa
root#iWave-G22M:~# auditctl
When i restart the audit daemon ( i.e /etc/init.d/auditd restart ) and try to list the rules i get the message No rules
$ /etc/init.d/auditd restart
Restarting audit daemon auditd
type=1305 audit(1558188111.980:3): audit_pid=0 old=1148 auid=4294967295 ses=4294967295
res=1
type=1305 audit(1558188112.010:4): audit_enabled=1 old=1 auid=4294967295 ses=4294967295
res=1
type=1305 audit(1558188112.020:5): audit_pid=30342 old=0 auid=4294967295 ses=4294967295
res=1
1
$ auditctl -l
No rules
OS INFO
$ uname -a
Linux iWave-G22M 3.10.31-ltsi-svn743 #5 SMP PREEMPT Mon May 27 18:28:01 IST 2019 armv7l GNU/Linux
audit_2.8.4.bb file was used to install auditd daemon via yocto
path of audit_2.8.4.bb -- http://git.yoctoproject.org/cgit/cgit.cgi/meta-selinux/tree/recipes-security/audit/audit_2.8.4.bb?h=master
audit rules add via /etc/audit/audit.rules and auditctl command are not permanent. to make them permanent across reboot you have to add them /etc/audit/rules.d/audit.rules file.
after adding the rule, restart auditd service and run command auditctl -l, it will list all the rules and also reflect in /etc/audit/audit.rules file.

Cocoapods pod repo push git

I am trying to push my pod to local repo. Before that, I have verified pod lib lint on my repo, and working fine locally
$ pod lib lint --swift-version=5.0 --allow-warnings
/System/Library/Frameworks/Ruby.framework/Versions/2.3/usr/lib/ruby/2.3.0/universal-darwin18/rbconfig.rb:215: warning: Insecure world writable dir /usr/local/sbin in PATH, mode 040777
-> SFLocationManager (1.0)
- WARN | source: Git SSH URLs will NOT work for people behind firewalls configured to only allow HTTP, therefore HTTPS is preferred.
- NOTE | xcodebuild: note: Using new build system
- NOTE | xcodebuild: note: Planning build
- NOTE | xcodebuild: note: Constructing build description
SFLocationManager passed validation.
After this, I have created tags and pushed to server
$ git tag
0.1.0
0.1.1
1.0
Then I have tried to test pod repo push command for local repo, which got failed
$ pod repo push git#git.url.com:ankit.thakur/locationmanager.git SFLocationManager.podspec --allow-warnings --swift-version=5.0 --local-only
/System/Library/Frameworks/Ruby.framework/Versions/2.3/usr/lib/ruby/2.3.0/universal-darwin18/rbconfig.rb:215: warning: Insecure world writable dir /usr/local/sbin in PATH, mode 040777
Validating spec
-> SFLocationManager (1.0)
- WARN | source: Git SSH URLs will NOT work for people behind firewalls configured to only allow HTTP, therefore HTTPS is preferred.
- ERROR | file patterns: The `source_files` pattern did not match any file.
- NOTE | xcodebuild: note: Using new build system
- NOTE | xcodebuild: note: Planning build
- NOTE | xcodebuild: note: Constructing build description
[!] The `SFLocationManager.podspec` specification does not validate.
Then I removed --local-only flag and ran again, but still failed.
$ pod repo push git#git.url.com:ankit.thakur/locationmanager.git SFLocationManager.podspec --allow-warnings --swift-version=5.0
/System/Library/Frameworks/Ruby.framework/Versions/2.3/usr/lib/ruby/2.3.0/universal-darwin18/rbconfig.rb:215: warning: Insecure world writable dir /usr/local/sbin in PATH, mode 040777
Validating spec
-> SFLocationManager (1.0)
- WARN | source: Git SSH URLs will NOT work for people behind firewalls configured to only allow HTTP, therefore HTTPS is preferred.
- ERROR | file patterns: The `source_files` pattern did not match any file.
- NOTE | xcodebuild: note: Using new build system
- NOTE | xcodebuild: note: Planning build
- NOTE | xcodebuild: note: Constructing build description
[!] The `SFLocationManager.podspec` specification does not validate.
Here is the pod version
$ pod --version
/System/Library/Frameworks/Ruby.framework/Versions/2.3/usr/lib/ruby/2.3.0/universal-darwin18/rbconfig.rb:215: warning: Insecure world writable dir /usr/local/sbin in PATH, mode 040777
1.6.0
Here is the podspec file:
#
# Be sure to run `pod lib lint SFLocationManager.podspec' to ensure this is a
# valid spec before submitting.
#
# Any lines starting with a # are optional, but their use is encouraged
# To learn more about a Podspec see https://guides.cocoapods.org/syntax/podspec.html
#
Pod::Spec.new do |spec|
spec.name = 'SFLocationManager'
spec.version = '1.0'
spec.summary = 'SFLocationManager is location based library for iOS and Mac'
# This description is used to generate tags and improve search results.
# * Think: What does it do? Why did you write it? What is the focus?
# * Try to keep it short, snappy and to the point.
# * Write the description between the DESC delimiters below.
# * Finally, don't worry about the indent, CocoaPods strips it!
spec.description = <<-DESC
Location library in beta test version to fetch location with scheduled interval.
DESC
spec.homepage = 'https://git.url.com/ankit.thakur/locationmanager'
# spec.screenshots = 'www.example.com/screenshots_1', 'www.example.com/screenshots_2'
spec.license = { :type => 'MIT', :file => 'LICENSE' }
spec.author = { 'ankitthakur' => 'ankit.thakur#url.com' }
spec.source = { :git => 'git#git.url.com:ankit.thakur/locationmanager.git', :tag => spec.version.to_s }
# spec.social_media_url = 'https://twitter.com/<TWITTER_USERNAME>'
spec.requires_arc = true
spec.ios.deployment_target = '10.0'
spec.osx.deployment_target = '10.10'
spec.source_files = 'SFLocationManager/Sources/Common/**/*.swift'
# spec.ios.source_files = 'SFLocationManager/Sources/iOS/**/*.swift'
# spec.osx.source_files = 'SFLocationManager/Sources/OSX/**/*.swift'
# spec.resource_bundles = {
# 'SFLocationManager' => ['SFLocationManager/Assets/*.png']
# }
spec.frameworks = 'CoreLocation'
# spec.public_header_files = 'Pod/Classes/**/*.h'
# spec.frameworks = 'UIKit', 'MapKit'
# spec.dependency 'AFNetworking', '~> 2.3'
end
The response of spec.source_files is
$ ls -al SFLocationManager/Sources/Common/**/*.swift
-rw-r--r--# 1 ankitthakur staff 2710 Apr 25 18:02 SFLocationManager/Sources/Common/GeocoderUtils/Geocoder.swift
-rw-r--r--# 1 ankitthakur staff 613 Apr 25 18:21 SFLocationManager/Sources/Common/LocationManager/LocationConfiguration.swift
-rw-r--r--# 1 ankitthakur staff 324 Apr 25 18:02 SFLocationManager/Sources/Common/LocationManager/LocationError.swift
-rw-r--r--# 1 ankitthakur staff 241 Apr 25 18:02 SFLocationManager/Sources/Common/LocationManager/LocationEventType.swift
-rw-r--r--# 1 ankitthakur staff 7144 Apr 25 18:36 SFLocationManager/Sources/Common/LocationManager/LocationManager.swift
-rw-r--r--# 1 ankitthakur staff 4649 Apr 25 18:02 SFLocationManager/Sources/Common/Model/Location.swift
-rw-r--r--# 1 ankitthakur staff 3939 Apr 25 18:27 SFLocationManager/Sources/Common/Trigger/LocationTriggerManager.swift
As per suggestions in provided solutions, my updated Podspec is
#
# Be sure to run `pod lib lint SFLocationManager.podspec' to ensure this is a
# valid spec before submitting.
#
# Any lines starting with a # are optional, but their use is encouraged
# To learn more about a Podspec see https://guides.cocoapods.org/syntax/podspec.html
#
Pod::Spec.new do |spec|
spec.name = 'SFLocationManager'
spec.version = '1.0'
spec.summary = 'SFLocationManager is location based library for iOS and Mac'
# This description is used to generate tags and improve search results.
# * Think: What does it do? Why did you write it? What is the focus?
# * Try to keep it short, snappy and to the point.
# * Write the description between the DESC delimiters below.
# * Finally, don't worry about the indent, CocoaPods strips it!
spec.description = <<-DESC
Location library in beta test version to fetch location with scheduled interval.
DESC
spec.homepage = 'https://git.promobitech.com/ankit.thakur/locationmanager'
# spec.screenshots = 'www.example.com/screenshots_1', 'www.example.com/screenshots_2'
spec.license = { :type => 'MIT', :file => 'LICENSE' }
spec.author = { 'ankitthakur' => 'ankit.thakur#promobitech.com' }
spec.source = { :git => 'git#git.promobitech.com:ankit.thakur/locationmanager.git', :tag => spec.version.to_s }
# spec.social_media_url = 'https://twitter.com/<TWITTER_USERNAME>'
spec.requires_arc = true
spec.ios.deployment_target = '10.0'
spec.osx.deployment_target = '10.10'
spec.source_files = 'SFLocationManager/Sources/Common/GeocoderUtils/*.{swift}',
'SFLocationManager/Sources/Common/LocationManager/*.{swift}',
'SFLocationManager/Sources/Common/Model/*.{swift}',
'SFLocationManager/Sources/Common/Trigger/*.{swift}'
# spec.ios.source_files = 'SFLocationManager/Sources/iOS/**/*.{swift}'
# spec.osx.source_files = 'SFLocationManager/Sources/OSX/**/*.{swift}'
# spec.resource_bundles = {
# 'SFLocationManager' => ['SFLocationManager/Assets/*.png']
# }
spec.frameworks = 'CoreLocation'
# spec.public_header_files = 'Pod/Classes/**/*.h'
# spec.frameworks = 'UIKit', 'MapKit'
# spec.dependency 'AFNetworking', '~> 2.3'
end
but it is still not working.
Here is the my podspec file:
Admin:locationmanager ankitthakur$ ls -al
total 40
drwxr-xr-x 10 ankitthakur staff 320 Apr 25 20:38 .
drwxr-xr-x 9 ankitthakur staff 288 Apr 25 20:38 ..
-rw-r--r-- 1 ankitthakur staff 6148 Apr 25 20:38 .DS_Store
drwxr-xr-x 14 ankitthakur staff 448 Apr 26 14:50 .git
drwxr-xr-x 10 ankitthakur staff 320 Apr 25 20:38 Example
-rw-r--r-- 1 ankitthakur staff 1086 Apr 25 20:38 LICENSE
-rw-r--r-- 1 ankitthakur staff 1029 Apr 25 20:38 README.md
drwxr-xr-x 4 ankitthakur staff 128 Apr 25 20:51 SFLocationManager
-rw-r--r-- 1 ankitthakur staff 2241 Apr 26 14:49 SFLocationManager.podspec
lrwxr-xr-x 1 ankitthakur staff 27 Apr 25 20:38 _Pods.xcodeproj -> Example/Pods/Pods.xcodeproj
The error says:
file patterns: The `source_files` pattern did not match any file.
This means that you have written a wrong pattern.
So you should correct your source_files like the following
s.source_files = "FOLDERNAME/*.{swift}"
(This will include all the Swift files under the folder "FOLDERNAME")
In case you have multiple folders, do like the following:
s.source_files = "FOLDERNAME1/*.{swift}" , "FOLDERNAME2/*.{swift}"

TORQUE jobs hang on Ubuntu 16.04

I have TORQUE installed on Ubuntu 16.04, and I am having trouble because my jobs hang. I have a test script test.pbs:
#PBS -N test
#PBS -l nodes=1:ppn=1
#PBS -l walltime=0:01:00
cd $PBS_O_WORKDIR
touch done.txt
echo "done"
And I run it with
qsub test.pbs
The job writes done.txt and echoes "done" just fine, but the job hangs in the C state.
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
46.localhost test wlandau 00:00:00 C batch
Edit: some diagnostic info on another job from qstat -f 55
qstat -f 55
Job Id: 55.localhost
Job_Name = test
Job_Owner = wlandau#localhost
resources_used.cput = 00:00:00
resources_used.mem = 0kb
resources_used.vmem = 0kb
resources_used.walltime = 00:00:00
job_state = C
queue = batch
server = haggunenon
Checkpoint = u
ctime = Mon Oct 30 07:35:00 2017
Error_Path = localhost:/home/wlandau/Desktop/test.e55
exec_host = localhost/2
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Mon Oct 30 07:35:00 2017
Output_Path = localhost:/home/wlandau/Desktop/test.o55
Priority = 0
qtime = Mon Oct 30 07:35:00 2017
Rerunable = True
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=1
Resource_List.walltime = 00:01:00
session_id = 5115
Variable_List = PBS_O_QUEUE=batch,PBS_O_HOST=localhost,
PBS_O_HOME=/home/wlandau,PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=wlandau,
PBS_O_PATH=/home/wlandau/bin:/home/wlandau/.local/bin:/usr/local/sbin
:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/ga
mes:/snap/bin,PBS_O_SHELL=/bin/bash,PBS_SERVER=localhost,
PBS_O_WORKDIR=/home/wlandau/Desktop
comment = Job started on Mon Oct 30 at 07:35
etime = Mon Oct 30 07:35:00 2017
exit_status = 0
submit_args = test.pbs
start_time = Mon Oct 30 07:35:00 2017
Walltime.Remaining = 60
start_count = 1
fault_tolerant = False
comp_time = Mon Oct 30 07:35:00 2017
And a similar tracejob -n2 62:
/var/spool/torque/server_priv/accounting/20171029: No matching job records located
/var/spool/torque/server_logs/20171029: No matching job records located
/var/spool/torque/mom_logs/20171029: No matching job records located
/var/spool/torque/sched_logs/20171029: No matching job records located
Job: 62.localhost
10/30/2017 17:20:25 S enqueuing into batch, state 1 hop 1
10/30/2017 17:20:25 S Job Queued at request of wlandau#localhost, owner =
wlandau#localhost, job name = jobe945093c2e029c5de5619d6bf7922071,
queue = batch
10/30/2017 17:20:25 S Job Modified at request of Scheduler#Haggunenon
10/30/2017 17:20:25 S Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb
resources_used.vmem=0kb resources_used.walltime=00:00:00
10/30/2017 17:20:25 L Job Run
10/30/2017 17:20:25 S Job Run at request of Scheduler#Haggunenon
10/30/2017 17:20:25 S Not sending email: User does not want mail of this type.
10/30/2017 17:20:25 S Not sending email: User does not want mail of this type.
10/30/2017 17:20:25 M job was terminated
10/30/2017 17:20:25 M obit sent to server
10/30/2017 17:20:25 A queue=batch
10/30/2017 17:20:25 M scan_for_terminated: job 62.localhost task 1 terminated, sid=17917
10/30/2017 17:20:25 A user=wlandau group=wlandau
jobname=jobe945093c2e029c5de5619d6bf7922071 queue=batch
ctime=1509398425 qtime=1509398425 etime=1509398425 start=1509398425
owner=wlandau#localhost exec_host=localhost/0 Resource_List.ncpus=1
Resource_List.neednodes=1 Resource_List.nodect=1
Resource_List.nodes=1 Resource_List.walltime=01:00:00
10/30/2017 17:20:25 A user=wlandau group=wlandau
jobname=jobe945093c2e029c5de5619d6bf7922071 queue=batch
ctime=1509398425 qtime=1509398425 etime=1509398425 start=1509398425
owner=wlandau#localhost exec_host=localhost/0 Resource_List.ncpus=1
Resource_List.neednodes=1 Resource_List.nodect=1
Resource_List.nodes=1 Resource_List.walltime=01:00:00 session=17917
end=1509398425 Exit_status=0 resources_used.cput=00:00:00
resources_used.mem=0kb resources_used.vmem=0kb
resources_used.walltime=00:00:00
EDIT: jobs now hanging in E
After some tinkering, I am now using these settings. I have moved on to this tiny pipeline workflow, where some TORQUE jobs wait for other TORQUE jobs to finish. Unfortunately, all the jobs hang in the E state, and any number of jobs more than 4 will just stay queued. To keep things from hanging indefinitely, I have to sudo qdel -p each one, which I think is causing legitimate problems with the project's filesystem as well as an inconvenience.
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
113.localhost ...b73ec2cda6dca wlandau 00:00:00 E batch
114.localhost ...b6c8e6da05983 wlandau 00:00:00 E batch
115.localhost ...9123b8e20850b wlandau 00:00:00 E batch
116.localhost ...e6d49a3d7d822 wlandau 00:00:00 E batch
117.localhost ...8c3f6cb68927b wlandau 0 Q batch
118.localhost ...40b1d0cab6400 wlandau 0 Q batch
qmgr -c "list server" shows
Server haggunenon
server_state = Active
scheduling = True
max_running = 300
total_jobs = 5
state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:1 Exiting:3
acl_hosts = localhost
managers = root#localhost
operators = root#localhost
default_queue = batch
log_events = 511
mail_from = adm
query_other_jobs = True
resources_assigned.ncpus = 4
resources_assigned.nodect = 4
scheduler_iteration = 600
node_check_rate = 150
tcp_timeout = 6
mom_job_sync = True
pbs_version = 2.4.16
keep_completed = 0
submit_hosts = SERVER
allow_node_submit = True
next_job_number = 119
net_counter = 118 94 93
And qmgr -c "list queue batch"
Queue batch
queue_type = Execution
total_jobs = 5
state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:4
max_running = 300
resources_max.ncpus = 4
resources_max.nodes = 2
resources_min.ncpus = 1
resources_default.ncpus = 1
resources_default.nodect = 1
resources_default.nodes = 1
resources_default.walltime = 01:00:00
mtime = Wed Nov 1 07:40:45 2017
resources_assigned.ncpus = 4
resources_assigned.nodect = 4
keep_completed = 0
enabled = True
started = True
C state means the job has completed and its status is kept in the system. Usually the status is kept after job completion for a period of time specified by the keep_completed parameter. However certain types of failure may result in the job being kept in this state to provide the information necessary to examine the cause of failure.
Check the output of qstat -f 46 to see if there is anything indicating an error.
To tune the keep_completed parameter you can execute the following command to check the value of this parameter on your system.
qmgr -c "print queue batch keep_completed"
If you have administrative privileges on the Torque server you could also change this value with
qmgr -c "set queue batch keep_completed=120"
To keep jobs in completed state for 2 minutes after completion.
In general having keep_completed set is a useful feature. Advanced schedulers use the information on completed jobs to schedule around failures.