I am having some real problems now with TIBCO RV, at the exact same time two machines reported dataloss on INBOUND. The 3rd machine, the remote daemon, did not report any unusual errors. This is happening too often, even though I've manually changed the rx/tx buffer size on network cards. I am not using CDM. Any ideas? Thanks
Machine 1
2013-01-08 12:58:21 /usr/tibco/tibrv/bin/realrvd64: TIB/Rendezvous Error: {ADV_CLASS="ERROR" ADV_SOURCE="SYSTEM" ADV_NAME="DATALOSS.INBOUND.BCAST" ADV_DESC="dataloss: remote daemon did not satisfy our retransmission request(s)" host="Machine 3" lost=3 scid=13001}
Machine 2
2013-01-08 12:58:21 /usr/tibco/tibrv/bin/realrvd64: TIB/Rendezvous Error: {ADV_CLASS="ERROR" ADV_SOURCE="SYSTEM" ADV_NAME="DATALOSS.INBOUND.BCAST" ADV_DESC="dataloss: remote daemon did not satisfy our retransmission request(s)" host="Machine 3" lost=3 scid=13001}
Related
(xpost from superuser with no answers.)
I am trying to reconfigure a known (virtual?) com port on multiple computers on a local network using a batch file.
A USB device we use is installed always as com9 and always comes in as default 9600 baud, and we have to manually reconfigure each station to 57600 baud.
I already have this batch file renaming printers, dns servers, Killing and starting tasks, copying files and a whole lot more, I've experimented with mode, but I'm either not using it properly or it can't do what I want.
I know I can use the GUI, but for the sake of speed, I want the batch to do it.
Sorry if this is a copy, but I'm seeing if anyone has an angle for me, I'm not afraid of personal research, but I'm running into dead ends with no leads.
Ask if you need any clarifications, and thanks in advance.
Powershell is okay too if I know what I need and can still stay in the cmd environment.
I want to use Postgresql in Windows Server 2012 R2 for one our project where it can be 24/7 uptime.
I would like to ask the community if I can have 2 master instances in 2 different servers A&B and they will 'work' on the same DB located in a shared file storage in lan. Always one master instance on server A will be online and when it goes offline for some reason (I suppose) a powershell script will recognize that the postgresql service stopped and will start the service in server B. The same script will continuous check that only one service in servers A & B is working to avoid conflicts.
I'd like to ask if this is possible or a better approach for my configuration.
(I can't use replication because when server A shuts down the server B is in read-only mode thing that I don't want)
If you manage to start two instances of PostgreSQL on the same data directory, serious data corruption will happen.
Normally there is a postmaster.pid file that prevents that, but a PostgreSQL server process on a different machine that accesses the same file system will happily unlink that after spewing some log messages, thinking it was left behind from a crash.
So you are really walking on thin ice with a solution like that.
One other issue that you didn't think of is that script that is supposed to check if the server is still running. What if that script fails, because for example the network connection between the two servers is down, but the server is still up an running happily? Such a “split brain” scenario will cause data corruption with your setup.
Another word of caution: since you seem to be using Windows (Powershell?), you probably envision a CIFS file system when you are talking of shared storage. A Windows “network share” is not a reliable file system — last time I checked, it did not honor _commit.
Creating a reliable failover cluster is harder than you think, and I'd recommend that you check existing solutions before you try to roll your own.
This question is an extension for that question.
Yet again: I'm working under CentOS 6.0 and I have a remote win7 folder, mounted with:
mount -t cifs //PC128/mnt /media/net -o "username=WORKGROUP\user,password=pwd,rw,noexec,soft,uid=user,gid=user"
When remote folder is not available (e.g. network cable is pulled out) an attempt to access the remote folder locks an application I'm working on. At first I detected that QDir::exists() caused locking for 20-90 seconds (I still can't find out why such difference), further I detected that any call to stat() function leads to application lock.
I followed an advice provided in topic above, I moved QDir::exists() call (and later - call to the stat() function) to another thread and this didn't solve the problem. The application still hangs when connection is suddenly lost. Qt trace shows that lock is somewhere in the kernel:
0 __kernel_vsyscall
1 __xstat64#GLIBC_2.1 /lib/libc.so.6
2 QFSFileEnginePrivate::doStat stat.h
I did also tried to check if remote share is still mounted before trying to access folder itself, but it didn't help. Approaches such as:
mount | grep /media/net
show that shared folder is still mounted even is there is no active connection to the network.
Checking folder status differences such as:
stat -fc%t:%T /media/net/ != stat -fc%t:%T /media/net/..
also hangs for ~20 seconds.
So I have several questions:
Is there any way to change CIFS timeouts? I did try to find out but it seems that there is no appropriate parameters and no CIFS config.
How can I check if remote folder is still mounted and not get locked?
How can I check is folder exists and also not get locked?
Your problem: "An unreachable network filesystem" is a very well known example which trigger linux hung task which isn't the same of zombies process at all(killing the parent PID won't do anything)
An hung task, is task which triggered a system call that cause problem in the kernel, so that the system call never return.
The major particularity is that the task is declared in the "D" state by the scheduler which mean the program is in an uninterruptible state. This mean that you can do nothing to stop you program: You can trigger all signal to the task, it would not respond. Launching hundreds of SIGTERM/SIGKILL does nothing!
This the case whith my old kernel: when my nfs server crash, I need to reboot the client to kill the tasks using the filesystem. I compiled it a long time ago (I have still the build tree on my hdd) and during the configuration I saw this in lib/Kconfig.debug:
config DETECT_HUNG_TASK
bool "Detect Hung Tasks"
depends on DEBUG_KERNEL
default LOCKUP_DETECTOR
help
Say Y here to enable the kernel to detect "hung tasks",
which are bugs that cause the task to be stuck in
uninterruptible "D" state indefinitiley.
When a hung task is detected, the kernel will print the
current stack trace (which you should report), but the
task will stay in uninterruptible state. If lockdep is
enabled then all held locks will also be reported. This
feature has negligible overhead.
It was only proposing to detect such tash or panic on detection: I don't checked if recent kernel actually can solve the problem (It seems to be the case with your question), but I think it didn't worth enabling it.
There is second problem : normally, the detection occur after 120 seconds, but I saw also a Konfig option for this:
config DEFAULT_HUNG_TASK_TIMEOUT
int "Default timeout for hung task detection (in seconds)"
depends on DETECT_HUNG_TASK
default 120
help
This option controls the default timeout (in seconds) used
to determine when a task has become non-responsive and should
be considered hung.
It can be adjusted at runtime via the kernel.hung_task_timeout_secs
sysctl or by writing a value to
/proc/sys/kernel/hung_task_timeout_secs.
A timeout of 0 disables the check. The default is two minutes.
Keeping the default should be fine in most cases.
This also works with kernel threads: example: make a loop device to a file on a fuse filesystem. Then crash the userspace program controlling the fuse filesystem!
You should a get a Ktread which name is in the form loopX (X correspond normally to your loopback device number) HUNGing!
weblinks:
https://unix.stackexchange.com/questions/5642/what-if-kill-9-does-not-work (look at the answer written by ultrasawblade)
http://www.linuxquestions.org/questions/linux-general-1/kill-a-hung-task-when-kill-9-doesn't-help-697305/
http://forums-web2.gentoo.org/viewtopic-t-811557-start-0.html
http://comments.gmane.org/gmane.linux.kernel/1189978
http://comments.gmane.org/gmane.linux.kernel.cifs/7674 (This is a case similar to yours)
In your case of the three question: you have the answer: This probably due to what is probably a well known bug in the vfs linux kernel layer! (There is no CIFS timeouts)
After much trial & error I found a solution that persists.
# vim /etc/fstab
//192.168.1.122/myshare /mnt/share cifs username=user,password=password,_netdev 0 0
The _netdev option is important since we are mounting a network device. Clients may hang during the boot process if the system encounters any difficulties with the network.
https://www.redhat.com/sysadmin/samba-windows-linux
I have a fc14 32 bit system with 2.6.35.13 custom compiled kernel.
When I try to start G-wan I get a "Segmentation fault".I've made no changes, just downloaded and unpacked the files from g-wan site.
In the log file I have:
"[Wed Dec 26 16:39:04 2012 GMT] Available network interfaces (16)"
which is not true, on the machine i have around 1k interfaces mostly ppp interfaces.
I think the crash has something to do with detecting interfaces/ip addresses because in the log after the above line I have 16 lines with ip's belonging to the fc14 machine and after that about 1k lines with "0.0.0.0" or "random" ip addresses.
I ran gwan 3.3.7 64-bit on a fc16 with about the same number of interfaces and had no problem,well it still reported a wrong number of interfaces (16) but it did not crashed and in the log file i got only 16 lines with the ip addresses belonging to the fc16 machine.
Any ideas?
Thanks
I have around 1k interfaces mostly ppp interfaces
Only the first 16 will be listed as this information becomes irrelevant with more interfaces (the intent was to let users find why a listen attempt failed).
This is probably the long 1K list, many things have changed internally after the allocator was redesigned from scratch. Thank you for reporting the bug.
I also confirm the comment which says that the maintenance script crashes. Thanks for that.
Note that bandwidth shaping will be modified to avoid the newer Linux syscalls so the GLIBC 2.7 requirement will be waved.
...with a custom compiled kernel
As a general rule, check again on a standard system like Debian 6.x before asking a question: there is room enough for trouble with a known system - there's no need to add custom system components.
Thank you all for the tons(!) of emails received these two last days about the new release!
I had a similar "Segmentation fault" error; mine happens any time I go to 9+GB of RAM. Exact same machine at 8GB works fine, and 10GB doesn't even report an error, it just returns to the prompt.
Interesting behavior... Have you tried adjusting the amount of RAM to see what happens?
(running G-WAN 4.1.25 on Debian 6.x)
I have an SNMP client, sending my PC SNMP traps with destination port 162.
I run a sniffer (Wireshark) from my PC, and see that the traps are indeed received.
The SNMP.exe and SNMPTRAP.exe system processes are up and running (I've even restarted them),and SNMPTRAP.exe is listening to port 162. I have no activated firewall (whether Windows or 3rd party).
The problem: On my PC I have three different applications, all registered to SNMPTRAP.exe. These are all off-the-shelf sw, not something I wrote. MG-SOFT Trap Ringer is, f.e., one of
them. NONE OF THEM CATCHES ANY OF THE TRAPS, and I have no idea where exactly the failure is.
Do you have any idea what may be causing this? Or how, perhaps, I can debug the SNMPTRAP.exe process?
Thanks!
You may disable Microsoft Trap Service in Services panel. Then MG-SOFT TRAP Ringer will work.
The reason is simply that when Microsoft's SnmpTrap.exe monitors port 162, it prevents other applications, such as MG-SOFT's one from monitoring that port.
How you can eliminate and corner the problem is, try below stuff.
How are sure its not catching traps ( since u see them in wireshark). I mean, should they be displayed on some GUI Screen, if yes are the fields empty? or logged into some file, if yes are you sure you are checking the correct file, have you configured either through conf files or command line options which file the traps need to be logged in?
If step 1 does not work, try installing or configuring the whole setup on different pc /laptop and see if it works?
If step 2 does not work, try a different trap daemon program ( SNMPTRAP.EXE), plenty of open source programs exist, if you could capture with different daemon program, then some issue with SNMPTRAP.EXE you are using.
I'm sure one of these steps should work for you, if not get back :)