We have an Orion instance which crashes about once each day or two.
In /var/log/contextBroker/contextBroker-service.out I have found:
log directory: '/var/log/contextBroker'
*** glibc detected *** /usr/bin/contextBroker: corrupted double-linked list: 0x00007f0ed92e3f70 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x75f4e)[0x7f0eecdeaf4e]
/lib64/libc.so.6(+0x763d3)[0x7f0eecdeb3d3]
/lib64/libc.so.6(+0x78c88)[0x7f0eecdedc88]
/usr/lib64/libstdc++.so.6(_ZNSsD1Ev+0x39)[0x7f0eed6404c9]
/usr/bin/contextBroker(_Z9jsonParseP14ConnectionInfoPKcRKSsP8JsonNodeP9ParseData+0x539)[0x56fb99]
/usr/bin/contextBroker(_Z9jsonTreatPKcP14ConnectionInfoP9ParseData11RequestTypeRKSsPP11JsonRequest+0x17d)[0x56cf0d]
/usr/bin/contextBroker(_Z12payloadParseP14ConnectionInfoP9ParseDataP11RestServicePP10XmlRequestPP11JsonRequestP18JsonDelayedReleaseRSt6vectorISsSaISsEE+0x3f2)[0x564012]
/usr/bin/contextBroker(_Z11restServiceP14ConnectionInfoP11RestService+0x126c)[0x5654bc]
/usr/bin/contextBroker[0x55cbb6]
/usr/bin/contextBroker[0x55f987]
/usr/lib64/libmicrohttpd.so.10(+0x5599)[0x7f0eee1cf599]
/usr/lib64/libmicrohttpd.so.10(MHD_connection_handle_idle+0x518)[0x7f0eee1d0078]
/usr/lib64/libmicrohttpd.so.10(+0xc3c8)[0x7f0eee1d63c8]
/lib64/libpthread.so.0(+0x7a51)[0x7f0eec957a51]
/lib64/libc.so.6(clone+0x6d)[0x7f0eece5d93d]
And in /var/log/contextBroker/contextBroker-service.out.old the following:
log directory: '/var/log/contextBroker'
*** glibc detected *** /usr/bin/contextBroker: free(): invalid next size (fast): 0x00007fe6d4262110 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x75f4e)[0x7fe6e8e9cf4e]
/lib64/libc.so.6(+0x78cf0)[0x7fe6e8e9fcf0]
/usr/bin/contextBroker(_ZN20ContextElementVector7releaseEv+0x2fa)[0x5f4a4a]
/usr/bin/contextBroker(_Z17postUpdateContextP14ConnectionInfoiRSt6vectorISsSaISsEEP9ParseDatab+0x1472)[0x4d6692]
/usr/bin/contextBroker(_Z11restServiceP14ConnectionInfoP11RestService+0x6d6)[0x564926]
/usr/bin/contextBroker[0x55cbb6]
/usr/bin/contextBroker[0x55f987]
/usr/lib64/libmicrohttpd.so.10(+0x5599)[0x7fe6ea281599]
/usr/lib64/libmicrohttpd.so.10(MHD_connection_handle_idle+0x518)[0x7fe6ea282078]
/usr/lib64/libmicrohttpd.so.10(+0xc3c8)[0x7fe6ea2883c8]
/lib64/libpthread.so.0(+0x7a51)[0x7fe6e8a09a51]
/lib64/libc.so.6(clone+0x6d)[0x7fe6e8f0f93d]
Data is sent to the Orion in batches each 5 minutes:
a request with around 500 contextElements
a request with around 10 contextElements
a request with a single contextElements
Orion has only 2 subscriptions (AFAIK) which send the data to Proton-CEP.
The Orion version is:
[centos#orion ~]$ /usr/bin/contextBroker --version
0.25.0 (git version: a8cf800d4e9fdd7b4293a886490c40309a5bb58c)
Copyright 2013-2015 Telefonica Investigacion y Desarrollo, S.A.U
Is there anything we can do to debug the issue?
Taking into account user inputs, Orion seems to be running below the recommended CPU and RAM thresholds (see recomendations). Thus, probably with more resources (e.g. 2 vCPU and 4GM RAM) it run better, specially if MongoDB runs in the same machine that Orion (MongoDB is known to be a memory-intensive process).
Related
Yesterday we had crash of PostgreSQL 9.5.14 running on Debian 8 (Linux xxxxxx 3.16.0-7-amd64 #1 SMP Debian 3.16.59-1 (2018-10-03) x86_64 GNU/Linux) - Segmentation fault. Database closed all connections and reinitialized itself staying ~1 minute in recovery mode.
PostgreSQL log:
2018-10-xx xx:xx:xx UTC [580-2] LOG: server process (PID 16461) was
terminated by signal 11: Segmentation fault
kern.log:
Oct xx xx:xx:xx xxxxxxxx kernel: [117977.301353] postgres[16461]:
segfault at 7efd3237db90 ip 00007efd3237db90 sp 00007ffd26826678 error
15 in libc-2.19.so[7efd322a2000+1a1000]
According to libc documentation (https://support.novell.com/docs/Tids/Solutions/10100304.html) error code 15 means:
NX_EDEADLK 15 resource deadlock would occur - which does not tell me much.
Could you tell me please if we can do something to avoid this problem in the future? Because this server is of course production one.
All packages are up to date currently. Upgrade of PG is unfortunately not the option. Server runs on Google Compute Engine.
error code 15 means: NX_EDEADLK 15
No, it doesn't mean that. This answer explains how to interpret 15 here.
It's bits 0, 1, 2, 3 set => protection fault, write access, user mode, use of reserved bit. Most likely your postgress process attempted to write to some wild pointer.
if we can do something to avoid this problem in the future?
The only thing you can do is find the bug and fix it, or upgrade to a release of postgress where that bug is already fixed (and hope that no new ones were introduced).
To understand where the bug might be, you should check whether a core dump was produced (if not, do enable them). If you have the core, use gdb /path/to/postgress /path/to/core, and then where GDB command. That will give you crash stack trace, which may allow you to find similar bug reports.
I am creating 4 mountpoint disk in Windows OS. I need to copy files up to a threshold value (say 50 GB).
I tried with vdbench. It works fine, but it throws an exception at last.
compratio=4
dedupratio=1
dedupunit=256k
* Host Definition section
hd=default,user=Administator,shell=vdbench,jvms=1
hd=localhost,system=localhost
********************************************************************************
* Storage Definition section
fsd=fsd1,anchor=C:\UnMapTest-Volume1\disk1\,depth=1,width=1,files=1,size=5g
fsd=fsd2,anchor=C:\UnMapTest-Volume2\disk2\,depth=1,width=1,files=1,size=5g
fwd=fwd1,fsd=fsd*,operation=write,xfersize=1m,fileio=sequential,fileselect=random,threads=10
rd=rd1,fwd=fwd1,fwdrate=max,format=yes,elapsed=1h,interval=1
Below is the exception from vdbench. Due to this my calling script would fail.
05:29:14.287 Message from slave localhost-0:
05:29:14.289 file=C:\UnMapTest-Volume1\disk1\\vdb.1_1.dir\vdb_f0001.file,busy=true
05:29:14.290 Thread: FwgThread write C:\UnMapTest-Volume1\disk1\ rd=rd1 For loops: None
05:29:14.291
05:29:14.292 last_ok_request: Thu Dec 28 05:28:57 PST 2017
05:29:14.292 Duration: 16.92 seconds
05:29:14.293 consecutive_blocks: 10001
05:29:14.294 last_block: FILE_BUSY File busy
05:29:14.294 operation: write
05:29:14.295
05:29:14.296 Do you maybe have more threads running than that you have
05:29:14.296 files and therefore some threads ultimately give up after 10000 tries?
05:29:14.300 *
05:29:14.301 ******************************************************
05:29:14.302 * Slave localhost-0 aborting: Too many thread blocks *
05:29:14.302 ******************************************************
05:29:14.303 *
05:29:21.235
05:29:21.235 Slave localhost-0 prematurely terminated.
05:29:21.235
05:29:21.235 Slave aborted. Abort message received:
05:29:21.235 Too many thread blocks
05:29:21.235
05:29:21.235 Look at file localhost-0.stdout.html for more information.
05:29:21.735
05:29:21.735 Slave localhost-0 prematurely terminated.
05:29:21.735
java.lang.RuntimeException: Slave localhost-0 prematurely terminated.
at Vdb.common.failure(common.java:335)
at Vdb.SlaveStarter.startSlave(SlaveStarter.java:198)
at Vdb.SlaveStarter.run(SlaveStarter.java:47)
I am using PowerShell in a Windows machine. Even if some other tools like Diskspd is having way to fill data up to some threshold then please provide me.
I found the answer by myself.
I have done this using Diskspd.exe as below
The following command fill 50 GB data in the mentioned disk folder
.\diskspd.exe -c50G -b4K -t2 C:\UnMapTest-Volume1\disk1\testfile1.dat
It is very simple than Vdbench for my requirement.
Caution : But it is not having real data so array side disk size is
not shown up with the size
I am using Asterisk 11.19.0 under production server. Asterisk getting crashed manytimes a day. I am getting below message.
* glibc detected * /usr/sbin/asterisk: malloc(): memory corruption: 0x00007f2d6c0016f0 ***
======= Backtrace: ========= /lib64/libc.so.6[0x3c49075f3e]
/lib64/libc.so.6[0x3c4907a4fa]
/lib64/libc.so.6(__libc_calloc+0xcd)[0x3c4907a8cd]
/usr/sbin/asterisk(ast_frdup+0xc7)[0x4dfce7]
/usr/sbin/asterisk[0x4709cf]
/usr/sbin/asterisk(ast_answer+0x324)[0x488d44]
/usr/lib/asterisk/modules/res_agi.so(+0x939b)[0x7f2f04f9a39b]
/usr/lib/asterisk/modules/res_agi.so(+0xb4c4)[0x7f2f04f9c4c4]
/usr/lib/asterisk/modules/res_agi.so(+0xbc24)[0x7f2f04f9cc24]
/usr/lib/asterisk/modules/res_agi.so(+0xd8d5)[0x7f2f04f9e8d5]
/usr/sbin/asterisk(pbx_exec+0x124)[0x520d74]
/usr/sbin/asterisk[0x52e689]
/usr/sbin/asterisk[0x532ac5]
/usr/sbin/asterisk[0x5342cb]
/usr/sbin/asterisk[0x5765ab]
/lib64/libpthread.so.0[0x3c49407aa1]
/lib64/libc.so.6(clone+0x6d)[0x3c490e8aad]
This info can't help you.
You have use gdb and core dump file.
For more info please consult https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace
We're running the Context Broker on a CentOS server but it keeps crashing with certain update queries. We've tried version 0.26 and the latest 1.0.0-1 but the result is the same, we've also tried changing the mongoDB version between 3.0.6 and 3.0.7 but no luck. The logs doesn't give us much to go on so that's why we're asking here in SO.
What we're doing is to send an update of an entity of about 1MB in size routed in from a http call via nginx. The context broker crashes (see logs below) but mongodb and other services continue to function normally.
Log file: /var/log/contextBroker/contextBroker.log
terminate called after throwing an instance of 'mongo::MsgAssertionException'
what(): EOO Before end of object
Log file: /var/log/messages
Apr 28 07:15:50 gl abrt[11457]: Saved core dump of pid 11426 (/usr/bin/contextBroker) to /var/spool/abrt/ccpp-2016-04-28-07:15:49-11426 (63606784 bytes)
Apr 28 07:15:50 gl abrtd: Directory 'ccpp-2016-04-28-07:15:49-11426' creation detected
Apr 28 07:15:50 gl abrtd: Package 'contextBroker' isn't signed with proper key
Apr 28 07:15:50 gl abrtd: 'post-create' on '/var/spool/abrt/ccpp-2016-04-28-07:15:49-11426' exited with 1
Apr 28 07:15:50 gl abrtd: Deleting problem directory '/var/spool/abrt/ccpp-2016-04-28-07:15:49-11426'
Output from the contextBroker when it's run in verbose mode:
INFO#14:05:27 logMsg.h[1792]: Starting transaction from 127.0.0.1:51245/v1/updateContext
INFO#14:05:27 connectionOperations.cpp[78]: Database Operation Successful (query: { id.id: "8a55c32500dfad.....06be56709b75b31c1f9beb7d2", id.type: "House", _id.servicePath: /^\/$/ })
terminate called after throwing an instance of 'mongo::MsgAssertionException'
what(): BSONElement: bad type 100
Any ideas about what could be causing this, or where we should continue looking?
This crash is due to a bug detected at Orion. A fix is on the way, so we hope it get merged and ready to be included in next Orion release (Orion 1.2.0).
Got system with Postgres broken (a RAID is the reason) , without any backups.
Trying to put data to another comptuter with Postgres (and make however backup).
But always when I set up data directory and run postgres I've got message
GET FATAL: database files are incompatible with server
2012-08-15 19:58:38 GET DETAIL: The database cluster was initialized with BLCKSZ 16777216, but the server was compiled with BLCKSZ 8192.
2012-08-15 19:58:38 GET HINT: It looks like you need to recompile or initdb.
It's very strange number 16777216(2 to power 24 - to big).
However I can't reset default value 8192 when compiling (playing with --with-blocksize= take no effect; BLCKSZ - I can't find it in headers files)
).
Any way to extract data ?
This is environment and circumstances:
harddrive: RAID 1 with 3 SAS disks in array
OS: ubuntu 10.04.04 amd64
Postgres: 9.1 (by apt-get (we change repository links to higher version of Ubuntu))
the system become broken - after some time got
AAC: Host Adapter BLINK LED 0x56
AACO: Adapter kernel panic'd 56
(filesystem or hardware error)
Somehow we got data directory. pg_conroldata shown:
pg_control version number: 903
Catalog version number: 201105231
Database system identifier: 5714530593695276911
Database cluster state: shut down
pg_control last modified: Tue 15 Aug 2012 11:50:50
Latest checkpoint location: 1B595668/2000020
Prior checkpoint location: 0/0
Latest checkpoint's REDO location: 1B595668/2000020
Latest checkpoint's TimeLineID: 1
Latest checkpoint's NextXID: 0/4057946
Latest checkpoint's NextOID: 40960
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 670
Latest checkpoint's oldestXID's DB: 1344846103
Latest checkpoint's oldestActiveXID: 0
Time of latest checkpoint: Tue 15 Aug 2012 11:50:50
Minimum recovery ending location: 0/0
Backup start location: 0/0
Current wal_level setting: minimal
Current max_connections setting: 100
Current max_prepared_xacts setting:0
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 16777216
Blocks per segment of large relation:131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 2387576020
Maximum size of a TOAST chunk: 0
Date/time type storage: floating-point numbers
Float4 argument passing: by reference
Float8 argument passing: by reference
First I effort to up DB in Ubuntu servers (harddisk - simple serial 2, Ubuntu 10.04 i386, Postgres 9.1) and got the same exception above (with BLCKSZ).
That's why I deployed Ubuntu 10.04 amd64 with english Postgres 9.1 (because got '?' instead of russian symbols in error logs in previous step) in virtual machine
Got the same exception (with BLCKSZ).
Ather that have removed apt-get postgres version and compiled it as described at docs http://www.postgresql.org/docs/9.1/static/installation.html.
Playing withconfigure --with-blocksize=BLOCKSIZE had take no effect - got the same error
Sorry, for the post.
The pg_contol was broken by some manipulations with.
Sow, the cluster was succeful restored by pg_resetxlog with initial data.
A blocksize of 16Mb would be really weird, and since these two values also look completely bogus:
Maximum columns in an index: 2387576020
Maximum size of a TOAST chunk: 0
...you might want to question the integrity of this data before spending time on compiling postgres with a non-standard block size.
If you look at the sizes of files corresponding to relations, are they multiple of 16Mb or 8Kb?
If the database have some gigabytes tables, what appears to be the cut-off size on disk (the size above which postgres split the data into several files)? This should be equal to data block size*Blocks per segment of large relation. On a default install, it's 1Gb.
See here for details on configuring kernel resources. Perhaps the default/current settings for this new OS won't allow the postmaster to start.
Here are details on the meaning and context of the BLCKSZ parameter. Was the system that failed running a 64bit build of PostgreSQL and the new system is a 32bit build? If possible, attempting to obtain version information on the failed system's PostgreSQL could shed light on the problem. Let us know what version, build, and OS were used. Was is a custom build?