DB2 online backup not working and can't force the application - db2

I've started full online backup of DB2 database (primary on HADR) on NFS. The db is about 3 TB.
From the beginnig it's 0% of work completed.
And I can't find the application responsible for backup:
db2 list applications show detail | grep -i backup
I hit Ctrl+C, but it's still listed in the db2 list utilities (even after 13hrs it's 0%).
Because of the backup the tablespaces are in the 0x0800 State.
What is the resolution of this riddle? How can I cancel this backup without stopping the instance and make the tablepaces back to 0x0000 state?
I've tried to takeover HADR with Standby server to Primary, but it didn't responded, possibly because of 0x0800 states of tablespaces.
Regards
db2pd -uti
Database Member 0 -- Active -- Up 59 days 21:12:12 -- Date 2023-02-04-13.11.23.647422
Utilities:
Address ID Type State Invoker Priority StartTime DBName NumPhases CurPhase Description
0x078000000800EB40 18236 BACKUP 0 0 0 Fri Feb 3 20:05:13 DB_NAME 1 1 online db
Progress:
Address ID PhaseNum CompletedWork TotalWork StartTime Description
0x078000000800F008 18236 1 0 bytes 3203172077568 bytes Fri Feb 3 20:05:13 n/a
And the output of the db2diag.log after executing the backup:
2023-02-03-20.05.13.038004+060 E1202639A533 LEVEL: Info
PID : 31982054 TID : 14137 PROC : db2sysc 0
INSTANCE: inst_name NODE : 000 DB : DBNAME
APPHDL : 0-39477 APPID: *LOCAL.inst_name.230203190514
AUTHID : inst_name HOSTNAME: primary
EDUID : 14137 EDUNAME: db2agent (DBNAME) 0
FUNCTION: DB2 UDB, database utilities, sqluxGetDegreeParallelism, probe:558
DATA #1 : <preformatted>
Autonomic backup/restore - using parallelism = 10.
2023-02-03-20.05.13.824937+060 E1203173A560 LEVEL: Info
PID : 31982054 TID : 14137 PROC : db2sysc 0
INSTANCE: inst_name NODE : 000 DB : DBNAME
APPHDL : 0-39477 APPID: *LOCAL.inst_name.230203190514
AUTHID : inst_name HOSTNAME: primary
EDUID : 14137 EDUNAME: db2agent (DBNAME) 0
FUNCTION: DB2 UDB, database utilities, sqluxGetAvailableHeapPages, probe:634
DATA #1 : <preformatted>
Autonomic BAR - heap consumption.
Targetting (50%) - 262144 of 524288 pages.
PID : 31982054 TID : 14137 PROC : db2sysc 0
INSTANCE: inst_name NODE : 000 DB : DBNAME
APPHDL : 0-39477 APPID: *LOCAL.inst_name.230203190514
AUTHID : inst_name HOSTNAME: primary
EDUID : 14137 EDUNAME: db2agent (DBNAME) 0
FUNCTION: DB2 UDB, database utilities, sqluxGetDegreeParallelism, probe:558
DATA #1 : <preformatted>
Autonomic backup/restore - using parallelism = 10.
2023-02-03-20.05.13.824937+060 E1203173A560 LEVEL: Info
PID : 31982054 TID : 14137 PROC : db2sysc 0
INSTANCE: inst_name NODE : 000 DB : DBNAME
APPHDL : 0-39477 APPID: *LOCAL.inst_name.230203190514
AUTHID : inst_name HOSTNAME: primary
EDUID : 14137 EDUNAME: db2agent (DBNAME) 0
FUNCTION: DB2 UDB, database utilities, sqluxGetAvailableHeapPages, probe:634
DATA #1 : <preformatted>
Autonomic BAR - heap consumption.
Targetting (50%) - 262144 of 524288 pages.
2023-02-03-20.05.13.825914+060 E1203734A547 LEVEL: Info
PID : 31982054 TID : 14137 PROC : db2sysc 0
INSTANCE: inst_name NODE : 000 DB : DBNAME
APPHDL : 0-39477 APPID: *LOCAL.inst_name.230203190514
AUTHID : inst_name HOSTNAME: primary
EDUID : 14137 EDUNAME: db2agent (DBNAME) 0
FUNCTION: DB2 UDB, database utilities, sqlubTuneBuffers, probe:889
DATA #1 : <preformatted>
Autonomic backup - tuning enabled.
Using buffer size = 4097, number = 20.
INSTANCE: inst_name NODE : 000 DB : DBNAME
APPHDL : 0-39477 APPID: *LOCAL.inst_name.230203190514
AUTHID : inst_name HOSTNAME: primary
EDUID : 14137 EDUNAME: db2agent (DBNAME) 0
FUNCTION: DB2 UDB, database utilities, sqluxGetAvailableHeapPages, probe:634
DATA #1 : <preformatted>
Autonomic BAR - heap consumption.
Targetting (50%) - 262144 of 524288 pages.
2023-02-03-20.05.13.825914+060 E1203734A547 LEVEL: Info
PID : 31982054 TID : 14137 PROC : db2sysc 0
INSTANCE: inst_name NODE : 000 DB : DBNAME
APPHDL : 0-39477 APPID: *LOCAL.inst_name.230203190514
AUTHID : inst_name HOSTNAME: primary
EDUID : 14137 EDUNAME: db2agent (DBNAME) 0
FUNCTION: DB2 UDB, database utilities, sqlubTuneBuffers, probe:889
DATA #1 : <preformatted>
Autonomic backup - tuning enabled.
Using buffer size = 4097, number = 20.
2023-02-03-20.05.13.868295+060 E1204282A493 LEVEL: Info
PID : 31982054 TID : 14137 PROC : db2sysc 0
INSTANCE: inst_name NODE : 000 DB : DBNAME
APPHDL : 0-39477 APPID: *LOCAL.inst_name.230203190514
AUTHID : inst_name HOSTNAME: primary
EDUID : 14137 EDUNAME: db2agent (DBNAME) 0
FUNCTION: DB2 UDB, database utilities, sqlubSetupJobControl, probe:2066
MESSAGE : Starting an online db backup.
2023-02-03-23.13.48.244675+060 E1204776A430 LEVEL: Info
PID : 31982054 TID : 12081 PROC : db2sysc 0
INSTANCE: inst_name NODE : 000 DB : DBNAME
HOSTNAME: primary
EDUID : 12081 EDUNAME: db2logmgr (DBNAME) 0
FUNCTION: DB2 UDB, data protection services, sqlpgArchiveLogFile, probe:3108
DATA #1 : <preformatted>
Started archive for log file S0048345.LOG.
db2pd -db DBNAME -apinfo 39477
Database Member 0 -- Database DBANME -- Active -- Up 60 days 00:21:25 -- Date 2023-02-04-16.20.38.252022
snapapp Time: 2023-02-04-16.20.38.253042
Application :
Address : 0x0A0001001470ECE0
AppHandl [nod-index] : 39477 [000-39477]
TranHdl : 692
Application PID : 11600524
Application Node Name : primary
IP Address: n/a
Connection Start Time : 2023-02-03-20.05.13.035704
Client User ID : instname
System Auth ID : instname
Coordinator EDU ID : 14137
Coordinator Member : 0
Registered Agents : 1
Active Agents : 1
Locks timeout value : NotSet
Locks Escalation : No
Workload ID : 1
Workload Occurrence ID : 1951323
Trusted Context : n/a
Connection Trust Type : non trusted
Role Inherited : n/a
Application Status : RequestInterrupted
Application Name : db2bp
Application ID : *LOCAL.instname.230203190514
ClientUserID : n/a
ClientWrkstnName : n/a
ClientApplName : n/a
ClientAccntng : n/a
CollectActData: N
CollectActPartition: C
SectionActuals: N
UOW start time : 2023-02-03-20.05.13.037338
UOW stop time :
--------------------------------------------------------------------------------
db2pd -db DBNAME -barstats 14137
Printing out Backup Runtime Statistics at 2023-02-04-16.23.20.523322:
Backup Related EDUs:
---------------------------------------------------------------------
Backup agent ID: 14137
MC 0 (EDU ID): db2med.14137.0 (70333)
Table Spaces to be Backed Up (appTblSpace):
---------------------------------------------------------------------
numEntry: 14
Table Spaces:
tblSpaceName: SYSCATSPACE
tblSpaceID: 0
tblSpaceType: 2
tblSpaceDataType: 0
tblSpaceSize: 28420 (in 4K pages)
nContainers: 1
backupInProgressTurnedOn: T
backupActiveIsSet: T
autoResizeLockAcquired: T
extentMovementLockAcquired: T
olbLockAcquired: F
userSpecified: T
tblSpaceTotalPages: 28668
pageSize: 32768
extSize: 4
actualSize: 931266560
scanPages: T
backupLSN: 0000000000000000
tblSpaceName: USERSPACE1
tblSpaceID: 2
tblSpaceType: 2
tblSpaceDataType: 32
tblSpaceSize: 29457664 (in 4K pages)
nContainers: 1
backupInProgressTurnedOn: T
backupActiveIsSet: T
autoResizeLockAcquired: T
extentMovementLockAcquired: T
olbLockAcquired: F
userSpecified: T
tblSpaceTotalPages: 29458016
pageSize: 32768
extSize: 32
actualSize: 965268733952
scanPages: T
backupLSN: 0000000000000000
tblSpaceName: SYSTOOLSPACE
tblSpaceID: 3
tblSpaceType: 2
tblSpaceDataType: 32
tblSpaceSize: 1652 (in 4K pages)
nContainers: 1
backupInProgressTurnedOn: T
backupActiveIsSet: T
autoResizeLockAcquired: T
extentMovementLockAcquired: T
olbLockAcquired: F
userSpecified: T
tblSpaceTotalPages: 2044
pageSize: 32768
extSize: 4
actualSize: 54132736
scanPages: T
backupLSN: 0000000000000000
tblSpaceName: SL
tblSpaceID: 4
tblSpaceType: 2
tblSpaceDataType: 32
tblSpaceSize: 4512 (in 4K pages)
nContainers: 1
backupInProgressTurnedOn: T
backupActiveIsSet: T
autoResizeLockAcquired: T
extentMovementLockAcquired: T
olbLockAcquired: F
userSpecified: T
tblSpaceTotalPages: 4832
pageSize: 32768
extSize: 32
actualSize: 147849216
scanPages: T
backupLSN: 0000000000000000
tblSpaceName: MONITOR
tblSpaceID: 5
tblSpaceType: 2
tblSpaceDataType: 32
tblSpaceSize: 3712 (in 4K pages)
nContainers: 1
backupInProgressTurnedOn: T
backupActiveIsSet: T
autoResizeLockAcquired: T
extentMovementLockAcquired: T
olbLockAcquired: F
userSpecified: T
tblSpaceTotalPages: 4128
pageSize: 32768
extSize: 32
actualSize: 121634816
scanPages: T
backupLSN: 0000000000000000
tblSpaceName: XMLSPACE1
tblSpaceID: 6
tblSpaceType: 2
tblSpaceDataType: 32
tblSpaceSize: 57837504 (in 4K pages)
nContainers: 1
backupInProgressTurnedOn: T
backupActiveIsSet: T
autoResizeLockAcquired: T
extentMovementLockAcquired: T
olbLockAcquired: F
userSpecified: T
tblSpaceTotalPages: 57837600
pageSize: 32768
extSize: 32
actualSize: 1895219331072
scanPages: T
backupLSN: 0000000000000000
tblSpaceName: INDSPACE1
tblSpaceID: 7
tblSpaceType: 2
tblSpaceDataType: 32
tblSpaceSize: 32384 (in 4K pages)
nContainers: 1
backupInProgressTurnedOn: T
backupActiveIsSet: T
autoResizeLockAcquired: T
extentMovementLockAcquired: T
olbLockAcquired: F
userSpecified: T
tblSpaceTotalPages: 32512
pageSize: 32768
extSize: 32
actualSize: 1061158912
scanPages: T
backupLSN: 0000000000000000
tblSpaceName: ARCHSPACE1
tblSpaceID: 8
tblSpaceType: 2
tblSpaceDataType: 32
tblSpaceSize: 9817120 (in 4K pages)
nContainers: 1
backupInProgressTurnedOn: T
backupActiveIsSet: T
autoResizeLockAcquired: T
extentMovementLockAcquired: T
olbLockAcquired: F
userSpecified: T
tblSpaceTotalPages: 9817760
pageSize: 32768
extSize: 32
actualSize: 321687388160
scanPages: T
backupLSN: 0000000000000000
tblSpaceName: DSMSPACE
tblSpaceID: 9
tblSpaceType: 2
tblSpaceDataType: 32
tblSpaceSize: 160 (in 4K pages)
nContainers: 1
backupInProgressTurnedOn: T
backupActiveIsSet: T
autoResizeLockAcquired: T
extentMovementLockAcquired: T
olbLockAcquired: F
userSpecified: T
tblSpaceTotalPages: 160
pageSize: 32768
extSize: 32
actualSize: 5242880
scanPages: T
backupLSN: 0000000000000000
tblSpaceName: DLSPACE
tblSpaceID: 11
tblSpaceType: 2
tblSpaceDataType: 32
tblSpaceSize: 301504 (in 4K pages)
nContainers: 1
backupInProgressTurnedOn: T
backupActiveIsSet: T
autoResizeLockAcquired: T
extentMovementLockAcquired: T
olbLockAcquired: F
userSpecified: T
tblSpaceTotalPages: 302048
pageSize: 32768
extSize: 32
actualSize: 9879683072
scanPages: T
backupLSN: 0000000000000000
tblSpaceName: DLSPACECOPY
tblSpaceID: 12
tblSpaceType: 2
tblSpaceDataType: 32
tblSpaceSize: 96 (in 4K pages)
nContainers: 1
backupInProgressTurnedOn: T
backupActiveIsSet: T
autoResizeLockAcquired: T
extentMovementLockAcquired: T
olbLockAcquired: F
userSpecified: T
tblSpaceTotalPages: 992
pageSize: 32768
extSize: 32
actualSize: 3145728
scanPages: T
backupLSN: 0000000000000000
tblSpaceName: REPORTS
tblSpaceID: 13
tblSpaceType: 2
tblSpaceDataType: 32
tblSpaceSize: 64352 (in 4K pages)
nContainers: 1
backupInProgressTurnedOn: T
backupActiveIsSet: T
autoResizeLockAcquired: T
extentMovementLockAcquired: T
olbLockAcquired: F
userSpecified: T
tblSpaceTotalPages: 64352
pageSize: 32768
extSize: 32
actualSize: 2108686336
scanPages: T
backupLSN: 0000000000000000
tblSpaceName: HISTORY
tblSpaceID: 14
tblSpaceType: 2
tblSpaceDataType: 32
tblSpaceSize: 3680 (in 4K pages)
nContainers: 1
backupInProgressTurnedOn: T
backupActiveIsSet: T
autoResizeLockAcquired: T
extentMovementLockAcquired: T
olbLockAcquired: F
userSpecified: T
tblSpaceTotalPages: 4064
pageSize: 32768
extSize: 32
actualSize: 120586240
scanPages: T
backupLSN: 0000000000000000
tblSpaceName: DLSPACECOPY2
tblSpaceID: 15
tblSpaceType: 2
tblSpaceDataType: 32
tblSpaceSize: 161600 (in 4K pages)
nContainers: 1
backupInProgressTurnedOn: T
backupActiveIsSet: T
autoResizeLockAcquired: T
extentMovementLockAcquired: T
olbLockAcquired: F
userSpecified: T
tblSpaceTotalPages: 161760
pageSize: 32768
extSize: 32
actualSize: 5291114496
scanPages: T
backupLSN: 0000000000000000
Table Space Queue:
[0]
tblSpaceName: XMLSPACE1
tblSpaceID: 6
backupStatus: pending
[1]
tblSpaceName: USERSPACE1
tblSpaceID: 2
backupStatus: pending
[2]
tblSpaceName: ARCHSPACE1
tblSpaceID: 8
backupStatus: pending
[3]
tblSpaceName: DLSPACE
tblSpaceID: 11
backupStatus: pending
[4]
tblSpaceName: DLSPACECOPY2
tblSpaceID: 15
backupStatus: pending
[5]
tblSpaceName: REPORTS
tblSpaceID: 13
backupStatus: pending
[6]
tblSpaceName: INDSPACE1
tblSpaceID: 7
backupStatus: pending
[7]
tblSpaceName: SYSCATSPACE
tblSpaceID: 0
backupStatus: pending
[8]
tblSpaceName: SL
tblSpaceID: 4
backupStatus: pending
[9]
tblSpaceName: MONITOR
tblSpaceID: 5
backupStatus: pending
[10]
tblSpaceName: HISTORY
tblSpaceID: 14
backupStatus: pending
[11]
tblSpaceName: SYSTOOLSPACE
tblSpaceID: 3
backupStatus: pending
[12]
tblSpaceName: DSMSPACE
tblSpaceID: 9
backupStatus: pending
[13]
tblSpaceName: DLSPACECOPY
tblSpaceID: 12
backupStatus: pending
The next table space to be backed up:
[0]
tblSpaceName: XMLSPACE1
tblSpaceID: 6
backupStatus: pending
Performance statistics:
---------------------------------------------------------------------
Parallelism = 10
Number of buffers = 20
Buffer size = 16781312 (4097 4kB pages)
BM# Total I/O MsgQ WaitQ Buffers Bytes
--- -------- -------- -------- -------- -------- --------
000 40811.03 0.00 0.00 40811.00 0 0
001 40811.03 0.00 0.00 40810.96 0 0
002 40811.03 0.00 0.00 40811.00 0 0
003 40811.03 0.00 0.00 40811.00 0 0
004 40811.03 0.00 0.00 40811.00 0 0
005 40811.03 0.00 0.00 40811.00 0 0
006 40811.03 0.00 0.00 40810.98 0 0
007 40811.03 0.00 0.00 40811.00 0 0
008 40811.03 0.00 0.00 40811.00 0 0
009 40811.04 0.00 0.00 40811.00 0 0
--- -------- -------- -------- -------- -------- --------
TOT - - - - 0 0
MC# Total I/O MsgQ WaitQ Buffers Bytes
--- -------- -------- -------- -------- -------- --------
000 73085.95 0.00 0.00 0.00 0 0
--- -------- -------- -------- -------- -------- --------
TOT - - - - 0 0
Size estimates:
---------------------------------------------------------------------
Total size estimate (bytes): 3203172077568
Pre-adjusted total size estimate (bytes): 3203172077568
Init data estimate (bytes): 4645780
User data estimate (bytes): 3201899954176
End data estimate (bytes): 1073735192
Size estimate for MC1 (bytes): 3203172077568
Size estimate for remaining MCs (bytes): 3203155300352
Progress Monitor:
---------------------------------------------------------------------
Phase #: 1 of 1
Total work units (bytes): 3203172077568
Completed work units (bytes): 0
Other Backup Statistics:
---------------------------------------------------------------------
Database bufferpool flushing time: 0.00 seconds
Table space bufferpool flushing time 1: 0.00 seconds
Table space bufferpool flushing time 2: Not Recorded
Table space bufferpool flushing time 3: Not Recorded
Database recovery history file (db2rhist.asc) elapsed processing time: Not Recorded
Table space change history file (db2tschg.his) elapsed processing time: Not Recorded
No logs are included in this image.

Related

ubuntu postgresql 12 : Out of memory: Killed process (postgres)

I am using unbuntu 20.04 and postgresql 12 .
My memory is 128 GB , SDD is 1TB. Cpu is i7 (16 core 20 threads)
I made simple c++ program which connect postgresql and generting tile map (just map image).
It's similar with osmtilemaker.
Once program starts, it takes serveral hours to servral months to finish job.
For first 4~5 hours, it runs well.
I monitored memory usage, and It doesnot occupy more than 10% of whole Memory at most.
Here is the screenshot of top command
Tasks: 395 total, 16 running, 379 sleeping, 0 stopped, 0 zombie
%Cpu(s): 70.3 us, 2.9 sy, 0.0 ni, 26.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 128565.5 total, 111124.7 free, 11335.7 used, 6105.1 buff/cache
MiB Swap: 2048.0 total, 2003.6 free, 44.4 used. 115108.6 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2217811 postgres 20 0 2369540 1.7g 678700 R 99.7 1.3 25:27.36 postgres
2217407 postgres 20 0 2393448 1.7g 678540 R 99.0 1.4 25:32.04 postgres
2217836 postgres 20 0 2352936 1.7g 679348 R 98.0 1.3 25:26.22 postgres
2217715 postgres 20 0 2368268 1.7g 680144 R 97.7 1.3 25:29.78 postgres
2217684 postgres 20 0 2384308 1.7g 679248 R 97.3 1.4 25:29.49 postgres
2217539 postgres 20 0 2386156 1.7g 680124 R 97.0 1.4 25:30.46 postgres
2216651 postgres 20 0 2429348 1.8g 678128 R 95.7 1.4 26:05.99 postgres
2217025 postgres 20 0 2396408 1.7g 679292 R 94.4 1.4 25:51.85 postgres
2238487 postgres 20 0 1294752 83724 54024 R 14.3 0.1 0:00.43 postgres
2238488 postgres 20 0 1294968 219304 189116 R 14.0 0.2 0:00.42 postgres
2238489 postgres 20 0 1294552 85624 56068 R 12.6 0.1 0:00.38 postgres
2062928 j 20 0 861492 536088 47396 S 6.6 0.4 19:18.64 mapTiler
2238490 postgres 20 0 1290132 73244 48084 R 6.3 0.1 0:00.19 postgres
2238491 postgres 20 0 1289876 73064 48160 R 6.3 0.1 0:00.19 postgres
928763 postgres 20 0 1181720 61368 59300 S 0.7 0.0 11:59.45 postgres
1306124 j 20 0 19668 2792 2000 S 0.3 0.0 0:06.84 screen
2238492 postgres 20 0 1273864 49192 40108 R 0.3 0.0 0:00.01 postgres
2238493 postgres 20 0 1273996 50172 40852 R 0.3 0.0 0:00.01 postgres
1 root 20 0 171468 9564 4864 S 0.0 0.0 0:09.40 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.02 kthreadd
I used 8 threads in program, so 8 process is using cpu a lot ,
but memory usage is always below 10%.
But, after 4~5 hours, oom-killer killed postgres processes and program stopped running.
Here is the result of dmesg
[62585.503398] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),
cpuset=/,mems_allowed=0,global_oom,
task_memcg=/system.slice/system-postgresql.slice/postgresql#12-main.service,
task=postgres,pid=463942, uid=129
[62585.503406] Out of memory: Killed process 463942 (postgres)
total-vm:19010060kB, anon-rss:17369476kB, file-rss:0kB,
shmem-rss:848380kB, UID:129 pgtables:36776kB oom_score_adj:0
It looks like out of memory error.
But, how that can happend when I have free memory more than 100GB?

MongoDB slow - memory usage very high

I have a 5 note replicaset mongoDB - 1 primary, 3 secondaries and 1 arbiter.
I am using mong version 4.2.3
Sizes:
“dataSize” : 688.4161271536723,
“indexes” : 177,
“indexSize” : 108.41889953613281
My Primary is very slow - each command from the shell takes a long time to return.
Memory usage seems very high, and looks like mongodb is consuming more than 50% of the RAM:
# free -lh
total used free shared buff/cache available
Mem: 188Gi 187Gi 473Mi 56Mi 740Mi 868Mi
Low: 188Gi 188Gi 473Mi
High: 0B 0B 0B
Swap: 191Gi 117Gi 74Gi
------------------------------------------------------------------
Top Memory Consuming Process Using ps command
------------------------------------------------------------------
PID PPID %MEM %CPU CMD
311 49145 97.8 498 mongod --config /etc/mongod.conf
23818 23801 0.0 3.8 /bin/prometheus --config.file=/etc/prometheus/prometheus.yml
23162 23145 0.0 8.4 /usr/bin/cadvisor -logtostderr
25796 25793 0.0 0.4 postgres: checkpointer
23501 23484 0.0 1.0 /postgres_exporter
24490 24473 0.0 0.1 grafana-server --homepath=/usr/share/grafana --config=/etc/grafana/grafana.ini --packaging=docker cfg:default.log.mode=console
top:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
311 systemd+ 20 0 313.9g 184.6g 2432 S 151.7 97.9 26229:09 mongod
23818 nobody 20 0 11.3g 150084 17988 S 20.7 0.1 8523:47 prometheus
23162 root 20 0 12.7g 93948 5964 S 65.5 0.0 18702:22 cadvisor
serverStatus memeory shows this:
octopusrs0:PRIMARY> db.serverStatus().mem
{
"bits" : 64,
"resident" : 189097,
"virtual" : 321404,
"supported" : true
}
octopusrs0:PRIMARY> db.serverStatus().tcmalloc.tcmalloc.formattedString
------------------------------------------------
MALLOC: 218206510816 (208097.9 MiB) Bytes in use by application
MALLOC: + 96926863360 (92436.7 MiB) Bytes in page heap freelist
MALLOC: + 3944588576 ( 3761.9 MiB) Bytes in central cache freelist
MALLOC: + 134144 ( 0.1 MiB) Bytes in transfer cache freelist
MALLOC: + 713330688 ( 680.3 MiB) Bytes in thread cache freelists
MALLOC: + 1200750592 ( 1145.1 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 320992178176 (306122.0 MiB) Actual memory used (physical + swap)
MALLOC: + 13979086848 (13331.5 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 334971265024 (319453.5 MiB) Virtual address space used
MALLOC:
MALLOC: 9420092 Spans in use
MALLOC: 234 Thread heaps in use
MALLOC: 4096 Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.
How can I detrmine what is causing this high memory consumption, and what can I do to return to nornal memory consumption?
Thanks,
Tamar

Osm2pgsql extremely slow on import on server with 192GB RAM

I have a server that has great specs, dual 6 core 3.3GHz 196GB ram with RAID 10 across 4 10K SAS drives. I wrote a script that should download each of the North America files and process them one by one rather than the entire section all at once.
processList.sh:
wget http://download.geofabrik.de/north-america/us/alabama-latest.osm.pbf -O ./geoFiles/north-america/us/alabama-latest.osm.pbf
osm2pgsql -d gis --create --slim -G --hstore --tag-transform-script ~/src/openstreetmap-carto/openstreetmap-carto.lua -C 2000 --number-processes 15 -S ~/src/openstreetmap-carto/openstreetmap-carto.style ./geoFiles/north-america/us/alabama-latest.osm.$
while read in;
do wget http://download.geofabrik.de/$in -O ./geoFiles/$in;
osm2pgsql -d gis --append --slim -G --hstore --tag-transform-script ~/src/openstreetmap-carto/openstreetmap-carto.lua -C 2000 --number-processes 15 -S ~/src/openstreetmap-carto/openstreetmap-carto.style ./geoFiles/$in;
done < maplist.txt
At first it starts out processing at nearly 400K points/second, then slows to 10k or less
osm2pgsql version 0.96.0 (64 bit id space)
Using lua based tag processing pipeline with script /root/src/openstreetmap-carto/openstreetmap-carto.lua
Using projection SRS 3857 (Spherical Mercator)
Setting up table: planet_osm_point
Setting up table: planet_osm_line
Setting up table: planet_osm_polygon
Setting up table: planet_osm_roads
Allocating memory for dense node cache
Allocating dense node cache in one big chunk
Allocating memory for sparse node cache
Sharing dense sparse
Node-cache: cache=2000MB, maxblocks=32000*65536, allocation method=11
Mid: pgsql, cache=2000
Setting up table: planet_osm_nodes
Setting up table: planet_osm_ways
Setting up table: planet_osm_rels
Reading in file: ./geoFiles/north-america/us/alabama-latest.osm.pbf
Using PBF parser.
Processing: Node(5580k 10.7k/s) Way(0k 0.00k/s) Relation(0 0.00/s))
I applied the performance stuff from https://wiki.openstreetmap.org/wiki/Osm2pgsql/benchmarks for Postgresql:
shared_buffers = 14GB
work_mem = 1GB
maintenance_work_mem = 8GB
effective_io_concurrency = 500
max_worker_processes = 8
max_parallel_workers_per_gather = 2
max_parallel_workers = 8
checkpoint_timeout = 1h
max_wal_size = 5GB
min_wal_size = 1GB
checkpoint_completion_target = 0.9
random_page_cost = 1.1
min_parallel_table_scan_size = 8MB
min_parallel_index_scan_size = 512kB
effective_cache_size = 22GB
Though it starts out well, it quickly deteriorates within about 20 seconds. Any idea why? I looked at top, but it didn't show anything really:
top - 22:48:46 up 3:11, 2 users, load average: 3.49, 4.03, 3.38
Tasks: 298 total, 1 running, 297 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni, 87.5 id, 12.5 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 19808144+total, 19237500+free, 780408 used, 4926040 buff/cache
KiB Swap: 29321212 total, 29321212 free, 0 used. 19437014+avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16156 root 20 0 50.819g 75920 8440 S 0.7 0.0 0:02.81 osm2pgsql
16295 root 20 0 42076 4156 3264 R 0.3 0.0 0:00.27 top
1 root 20 0 37972 6024 4004 S 0.0 0.0 0:07.10 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.02 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:00.05 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
6 root 20 0 0 0 0 S 0.0 0.0 0:00.58 kworker/u64:0
8 root 20 0 0 0 0 S 0.0 0.0 0:01.79 rcu_sched
9 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh
10 root rt 0 0 0 0 S 0.0 0.0 0:00.05 migration/0
11 root rt 0 0 0 0 S 0.0 0.0 0:00.03 watchdog/0
It had a large load average without listing anything as using it. Here are the results from iotop
Total DISK READ : 0.00 B/s | Total DISK WRITE : 591.32 K/s
Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 204.69 K/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
28638 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.60 % [kworker/u65:1]
20643 be/4 postgres 0.00 B/s 204.69 K/s 0.00 % 0.10 % postgres: wal writer process
20641 be/4 postgres 0.00 B/s 288.08 K/s 0.00 % 0.00 % postgres: checkpointer process
26923 be/4 postgres 0.00 B/s 98.55 K/s 0.00 % 0.00 % postgres: root gis [local] idle in transaction
1 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % init
2 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kthreadd]
3 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/0]
5 be/0 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kworker/0:0H]
6 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kworker/u64:0]
8 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [rcu_sched]
9 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [rcu_bh]
10 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/0]
11 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/0]
12 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/1]

Why does !heap -s heap not work the way intended?

My WinDBG version is 10.0.10240.9 AMD64 and while casually debugging some native memory dump I realized that my !heap command behaves different than described and I am unable to figure out why.
There are plenty of resources mentioning !heap -s:
https://msdn.microsoft.com/en-us/library/windows/hardware/ff563189%28v=vs.85%29.aspx?f=255&MSPPError=-2147217396
http://windbg.info/doc/1-common-cmds.html
When I execute !heap -s
I get this truncated list:
0:000> !heap -s
************************************************************************************************************************
NT HEAP STATS BELOW
************************************************************************************************************************
LFH Key : 0x000000c42ceaf6ca
Termination on corruption : ENABLED
Heap Flags Reserv Commit Virt Free List UCR Virt Lock Fast
(k) (k) (k) (k) length blocks cont. heap
-------------------------------------------------------------------------------------
Virtual block: 0000000003d40000 - 0000000003d40000 (size 0000000000000000)
... many more virtual blocks
0000000000b90000 00000002 3237576 3220948 3237576 20007 1749 204 359 0 LFH
0000000000010000 00008000 64 8 64 5 1 1 0 0
... more heaps
-------------------------------------------------------------------------------------
Ok fine, b90000 looks big but contrary to those docs above and !heap -s -? I cannot get information for this heap, each of those commands produce the exact same output as seen above (as if I would not specify anything after -s):
!heap -s b90000
!heap -s -h b90000
!heap -s 1
I get a load of virtual blocks and a dump of all heaps instead of the single specified one.
Anyone having the same issue?
My "Windows Debugger Version 10.0.10586.567 AMD64" behaved like yours, but
“Microsoft (R) Windows Debugger Version 6.3.9600.16384 AMD64” I have in in:
C:\Program Files\Windows Kits\8.1\Debuggers\x64
0:000> !heap -s -h 0000000000220000
Walking the heap 0000000000220000 ..................Virtual block: 0000000015f20000 - 0000000015f20000 (size 0000000000000000)
Virtual block: 000000001b2e0000 - 000000001b2e0000 (size 0000000000000000)
Virtual block: 000000001f1e0000 - 000000001f1e0000 (size 0000000000000000)
Virtual block: 0000000023c10000 - 0000000023c10000 (size 0000000000000000)
Virtual block: 000000001c060000 - 000000001c060000 (size 0000000000000000)
Virtual block: 000000001ddc0000 - 000000001ddc0000 (size 0000000000000000)
0: Heap 0000000000220000
Flags 00000002 - HEAP_GROWABLE
Reserved memory in segments 226880 (k)
Commited memory in segments 218204 (k)
Virtual bytes (correction for large UCR) 218740 (k)
Free space 12633 (k) (268 blocks)
External fragmentation 5% (268 free blocks)
Virtual address fragmentation 0% (30 uncommited ranges)
Virtual blocks 6 - total 0 KBytes
Lock contention 0
Segments 1
Low fragmentation heap 00000000002291e0
Lock contention 0
Metadata usage 90112 bytes
Statistics:
Segments created 993977
Segments deleted 992639
Segments reused 0
Block cache:
3: 1024 bytes ( 17, 0)
4: 2048 bytes ( 42, 0)
5: 4096 bytes ( 114, 0)
6: 8192 bytes ( 231, 2)
7: 16384 bytes ( 129, 9)
8: 32768 bytes ( 128, 11)
9: 65536 bytes ( 265, 58)
10: 131072 bytes ( 357, 8)
11: 262144 bytes ( 192, 49)
Buckets info:
Size Blocks Seg Empty Aff Distribution
------------------------------------------------
------------------------------------------------
Default heap Front heap Unused bytes
Range (bytes) Busy Free Busy Free Total Average
------------------------------------------------------------------
0 - 1024 577 140 1035286 11608 10563036 10
1024 - 2048 173 3 586 374 27779 36
2048 - 3072 17 19 47 224 1605 25
3072 - 4096 20 12 1 126 348 16
4096 - 5120 35 3 1 30 677 18
5120 - 6144 2 8 0 0 33 16
6144 - 7168 5 9 0 0 56 11
7168 - 8192 0 11 0 0 0 0
8192 - 9216 14 0 0 15 236 16
9216 - 10240 1 0 0 0 8 8
12288 - 13312 1 0 0 0 17 17
14336 - 15360 1 0 0 18 1 1
15360 - 16384 1 0 0 0 32 32
16384 - 17408 10 0 0 0 160 16
22528 - 23552 1 0 0 0 9 9
23552 - 24576 2 0 0 0 32 16
27648 - 28672 1 0 0 0 8 8
30720 - 31744 0 1 0 0 0 0
32768 - 33792 18 0 0 0 250 13
33792 - 34816 0 1 0 0 0 0
39936 - 40960 0 2 0 0 0 0
40960 - 41984 0 1 0 0 0 0
43008 - 44032 0 2 0 0 0 0
44032 - 45056 0 5 0 0 0 0
45056 - 46080 0 1 0 0 0 0
46080 - 47104 0 2 0 0 0 0
47104 - 48128 0 1 0 0 0 0
49152 - 50176 0 3 0 0 0 0
50176 - 51200 1 0 0 0 16 16
51200 - 52224 0 4 0 0 0 0
57344 - 58368 0 1 0 0 0 0
58368 - 59392 0 1 0 0 0 0
62464 - 63488 0 1 0 0 0 0
63488 - 64512 200 1 0 0 3200 16
64512 - 65536 0 1 0 0 0 0
65536 - 66560 1029 2 0 0 10624 10
79872 - 80896 100 0 0 0 900 9
131072 - 132096 9 0 0 0 144 16
193536 - 194560 1 0 0 0 9 9
224256 - 225280 1 0 0 0 16 16
262144 - 263168 49 27 0 0 784 16
327680 - 328704 1 0 0 0 17 17
384000 - 385024 0 1 0 0 0 0
523264 - 524288 1 5 0 0 23 23
------------------------------------------------------------------
Total 2271 268 1035921 12395 10610020 10
This might be a walkaround,
can’t answer why the win 10 version don’t work :-(

Mongodb with very high CPU rate

When I ran the following code and killed it immediately(that means to abnormally exit), the CPU rate of Mongodb would go extremely high(around 100%):
#-*- encoding:UTF-8 -*-
import threading
import time
import pymongo
single_conn = pymongo.Connection('localhost', 27017)
class SimpleExampleThread(threading.Thread):
def run(self):
print single_conn['scrapy'].zhaodll.count(), self.getName()
time.sleep(20)
for i in range(100):
SimpleExampleThread().start()
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
VIRT RES SHR S %CPU %MEM TIME+ COMMAND
696m 35m 6404 S 1181.7 0.1 391:45.31 mongod
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
My Mongodb version is 2.2.3. When the Mongodb worked well, ran the command "strace -c -p " for 1 minute giving the following output:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
33.50 0.322951 173 1867 nanosleep
33.19 0.319950 730 438 recvfrom
21.16 0.203969 16 12440 select
12.13 0.116983 19497 6 restart_syscall
0.02 0.000170 2 73 write
0.00 0.000016 0 146 sendto
0.00 0.000007 0 73 lseek
0.00 0.000000 0 2 read
0.00 0.000000 0 3 open
0.00 0.000000 0 3 close
0.00 0.000000 0 2 fstat
0.00 0.000000 0 87 mmap
0.00 0.000000 0 2 munmap
0.00 0.000000 0 1 pwrite
0.00 0.000000 0 3 msync
0.00 0.000000 0 29 mincore
0.00 0.000000 0 73 fdatasync
------ ----------- ----------- --------- --------- ----------------
100.00 0.964046 15248 total
When the cpu rate of Mongodb went very high(around 100%), ran the same command giving the following output:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
29.12 5.064230 3088 1640 nanosleep
28.83 5.013239 27851 180 recvfrom
22.72 3.950399 658400 6 restart_syscall
19.30 3.356491 327 10268 select
0.02 0.004026 67 60 sendto
0.01 0.001000 333 3 msync
0.00 0.000269 9 30 write
0.00 0.000125 4 30 fdatasync
0.00 0.000031 10 3 open
0.00 0.000000 0 2 read
0.00 0.000000 0 3 close
0.00 0.000000 0 2 fstat
0.00 0.000000 0 30 lseek
0.00 0.000000 0 57 mmap
0.00 0.000000 0 2 munmap
0.00 0.000000 0 1 pwrite
0.00 0.000000 0 14 mincore
------ ----------- ----------- --------- --------- ----------------
100.00 17.389810 12331 total
And if I run the command "lsof", there are many socks with the description "can't identify protocol". I don't know what goes wrong. Are there some bugs in Mongodb?
Thanks!