Liferay 7, Hikari-Pool connection not available error, happen on production but not on pre-production environment - postgresql

First of all, i have two environments strictly identically configured (exept IP) with two vm each. One in pre-production and one in production (currently in configuration phase). There is one vm with a liferay 7.0.6 tomcat bundle (build from 7.0.6-cumulative patch of the community-security-team) and an other with postgresql 9.4.26.
Everything works fine on pre-production environment.
On production environment, a few hours after beginning to create users in liferay i ran into this error (full stack at the end) :
Caused by: java.sql.SQLTransientConnectionException: HikariPool-2 - Connection is not available, request timed out after 937980ms.
at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:591)
at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:194)
at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:146)
at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:112)
at org.springframework.jdbc.datasource.LazyConnectionDataSourceProxy$LazyConnectionInvocationHandler.getTargetConnection(LazyConnectionDataSourceProxy.java:403)
at org.springframework.jdbc.datasource.LazyConnectionDataSourceProxy$LazyConnectionInvocationHandler.invoke(LazyConnectionDataSourceProxy.java:376)
at com.sun.proxy.$Proxy7.prepareStatement(Unknown Source)
at org.hibernate.jdbc.AbstractBatcher.getPreparedStatement(AbstractBatcher.java:534)
at org.hibernate.jdbc.AbstractBatcher.getPreparedStatement(AbstractBatcher.java:452)
at org.hibernate.jdbc.AbstractBatcher.prepareQueryStatement(AbstractBatcher.java:161)
at org.hibernate.loader.Loader.prepareQueryStatement(Loader.java:1700)
at org.hibernate.loader.Loader.doQuery(Loader.java:801)
at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:274)
at org.hibernate.loader.Loader.loadEntity(Loader.java:2037)
... 50 more
So i checked if there is differences between my liferay configuration on pre-production and the one in production using comparison software, and exept IP i found nothing. Idem with postgresql configurations on both environment.
I also checked time synchronization between vm and they are both synchronize via ntp on debian pool server.
Database vm :
n$ ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
0.debian.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.002
1.debian.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.002
2.debian.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.002
3.debian.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.002
+mail.klausen.dk 193.67.79.202 2 u 116 1024 377 14.435 -0.214 0.358
+any.time.nl 85.199.214.99 2 u 871 1024 377 1.666 -0.183 0.258
-rag.9t4.net 131.188.3.221 2 u 102 1024 377 16.491 -3.769 0.571
*ntp1.m-online.n 212.18.1.106 2 u 318 1024 377 16.608 -0.263 0.240
-tor-relais1.lin 131.188.3.223 2 u 199 1024 377 14.149 0.272 0.661
-www.kashra.com .DCFp. 1 u 150 1024 377 22.623 1.126 0.816
and Liferay vm :
$ ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
0.debian.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.002
1.debian.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.002
2.debian.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.002
3.debian.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.002
+stratum2-4.ntp. 129.70.137.82 2 u 200 1024 377 30.471 1.069 2.414
+138.201.16.225 131.188.3.221 2 u 696 1024 377 16.613 1.357 0.397
-kuehlich.com 131.188.3.221 2 u 373 1024 377 22.656 2.025 0.885
-time.cloudflare 10.21.8.19 3 u 566 1024 377 8.123 -0.640 0.277
*a.chl.la 131.188.3.222 2 u 167 1024 377 14.472 1.033 2.448
+195.50.171.101 145.253.3.52 2 u 266 1024 377 10.804 0.928 0.395
I also notice an error line in postgresql log about a request of process cancelation on unknown PID happening exactly the amout (937980ms) of milliseconds before the timeout error in Liferay:
LOG: PID 1767 in the cancel request does not match any process
I have tried re-installing Liferay from scratch but nothing change.
It should exist a difference between pre-production and production because it works fine on pre-production but i can't find it.
HikariCP configuration in liferay is default on both environment
jdbc.default.connectionTimeout=30000
jdbc.default.driverClassName=org.postgresql.Driver
jdbc.default.idleConnectionTestPeriod=60
jdbc.default.idleTimeout=600000
jdbc.default.initialPoolSize=10
jdbc.default.liferay.pool.provider=hikaricp
jdbc.default.maxActive=100
jdbc.default.maxIdleTime=3600
jdbc.default.maxLifetime=0
jdbc.default.maxPoolSize=100
jdbc.default.maximumPoolSize=100
jdbc.default.minIdle=10
jdbc.default.minPoolSize=10
jdbc.default.minimumIdle=10
, and same for postgresql :
max_connections = 100
HikariPool full satck from liferay :
2021-06-30 00:12:32.397 ERROR [liferay/scheduler_dispatch-6][JDBCExceptionReporter:234] HikariPool-2 - Connection is not available, request timed out after 937980ms.
2021-06-30 00:12:32.401 ERROR [liferay/scheduler_dispatch-6][BasePersistenceImpl:264] Caught unexpected exception
com.liferay.portal.kernel.exception.SystemException: com.liferay.portal.kernel.dao.orm.ORMException: org.hibernate.exception.GenericJDBCException: could not load an entity: [com.liferay.counter.model.impl.CounterImpl#com.liferay.counter.kernel.model.Counter]
at com.liferay.portal.kernel.service.persistence.impl.BasePersistenceImpl.processException(BasePersistenceImpl.java:270)
at com.liferay.counter.service.persistence.impl.CounterFinderImpl._obtainIncrement(CounterFinderImpl.java:391)
at com.liferay.counter.service.persistence.impl.CounterFinderImpl._competeIncrement(CounterFinderImpl.java:339)
at com.liferay.counter.service.persistence.impl.CounterFinderImpl._competeIncrement(CounterFinderImpl.java:325)
at com.liferay.counter.service.persistence.impl.CounterFinderImpl.increment(CounterFinderImpl.java:111)
at com.liferay.counter.service.persistence.impl.CounterFinderImpl.increment(CounterFinderImpl.java:100)
at com.liferay.counter.service.persistence.impl.CounterFinderImpl.increment(CounterFinderImpl.java:95)
at com.liferay.counter.service.impl.CounterLocalServiceImpl.increment(CounterLocalServiceImpl.java:42)
at sun.reflect.GeneratedMethodAccessor638.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.liferay.portal.spring.aop.ServiceBeanMethodInvocation.proceed(ServiceBeanMethodInvocation.java:163)
at com.liferay.portal.spring.transaction.CounterTransactionExecutor.execute(CounterTransactionExecutor.java:50)
at com.liferay.portal.spring.transaction.TransactionInterceptor.invoke(TransactionInterceptor.java:58)
at com.liferay.portal.spring.aop.ServiceBeanMethodInvocation.proceed(ServiceBeanMethodInvocation.java:137)
at com.liferay.portal.spring.aop.ServiceBeanAopProxy.invoke(ServiceBeanAopProxy.java:169)
at com.sun.proxy.$Proxy78.increment(Unknown Source)
at com.liferay.counter.kernel.service.CounterLocalServiceUtil.increment(CounterLocalServiceUtil.java:238)
at com.liferay.portal.kernel.systemevent.SystemEventHierarchyEntryThreadLocal.push(SystemEventHierarchyEntryThreadLocal.java:134)
at com.liferay.portal.kernel.systemevent.SystemEventHierarchyEntryThreadLocal.push(SystemEventHierarchyEntryThreadLocal.java:96)
at com.liferay.portal.repository.capabilities.TemporaryFileEntriesCapabilityImpl._runWithoutSystemEvents(TemporaryFileEntriesCapabilityImpl.java:313)
at com.liferay.portal.repository.capabilities.TemporaryFileEntriesCapabilityImpl.deleteExpiredTemporaryFileEntries(TemporaryFileEntriesCapabilityImpl.java:113)
at com.liferay.document.library.web.internal.messaging.TempFileEntriesMessageListener.deleteExpiredTemporaryFileEntries(TempFileEntriesMessageListener.java:111)
at com.liferay.document.library.web.internal.messaging.TempFileEntriesMessageListener$1.performAction(TempFileEntriesMessageListener.java:134)
at com.liferay.document.library.web.internal.messaging.TempFileEntriesMessageListener$1.performAction(TempFileEntriesMessageListener.java:130)
at com.liferay.portal.kernel.dao.orm.DefaultActionableDynamicQuery.performAction(DefaultActionableDynamicQuery.java:405)
at com.liferay.portal.kernel.dao.orm.DefaultActionableDynamicQuery$1.call(DefaultActionableDynamicQuery.java:315)
at com.liferay.portal.kernel.dao.orm.DefaultActionableDynamicQuery$1.call(DefaultActionableDynamicQuery.java:277)
at com.liferay.portal.kernel.dao.orm.DefaultActionableDynamicQuery.doPerformActions(DefaultActionableDynamicQuery.java:335)
at com.liferay.portal.kernel.dao.orm.DefaultActionableDynamicQuery.performActions(DefaultActionableDynamicQuery.java:86)
at com.liferay.document.library.web.internal.messaging.TempFileEntriesMessageListener.doReceive(TempFileEntriesMessageListener.java:139)
at com.liferay.portal.kernel.messaging.BaseMessageListener.receive(BaseMessageListener.java:26)
at com.liferay.portal.kernel.scheduler.messaging.SchedulerEventMessageListenerWrapper.receive(SchedulerEventMessageListenerWrapper.java:66)
at com.liferay.portal.kernel.messaging.InvokerMessageListener.receive(InvokerMessageListener.java:74)
at com.liferay.portal.kernel.messaging.ParallelDestination$1.run(ParallelDestination.java:52)
at com.liferay.portal.kernel.concurrent.ThreadPoolExecutor$WorkerTask._runTask(ThreadPoolExecutor.java:756)
at com.liferay.portal.kernel.concurrent.ThreadPoolExecutor$WorkerTask.run(ThreadPoolExecutor.java:667)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.liferay.portal.kernel.dao.orm.ORMException: org.hibernate.exception.GenericJDBCException: could not load an entity: [com.liferay.counter.model.impl.CounterImpl#com.liferay.counter.kernel.model.Counter]
at com.liferay.portal.dao.orm.hibernate.ExceptionTranslator.translate(ExceptionTranslator.java:34)
at com.liferay.portal.dao.orm.hibernate.SessionImpl.get(SessionImpl.java:205)
at com.liferay.portal.kernel.dao.orm.ClassLoaderSession.get(ClassLoaderSession.java:326)
at com.liferay.counter.service.persistence.impl.CounterFinderImpl._obtainIncrement(CounterFinderImpl.java:369)
... 36 more
Caused by: org.hibernate.exception.GenericJDBCException: could not load an entity: [com.liferay.counter.model.impl.CounterImpl#com.liferay.counter.kernel.model.Counter]
at org.hibernate.exception.SQLStateConverter.handledNonSpecificException(SQLStateConverter.java:140)
at org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:128)
at org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:66)
at org.hibernate.loader.Loader.loadEntity(Loader.java:2041)
at org.hibernate.loader.entity.AbstractEntityLoader.load(AbstractEntityLoader.java:86)
at org.hibernate.loader.entity.AbstractEntityLoader.load(AbstractEntityLoader.java:76)
at org.hibernate.persister.entity.AbstractEntityPersister.load(AbstractEntityPersister.java:3294)
at org.hibernate.event.def.DefaultLoadEventListener.loadFromDatasource(DefaultLoadEventListener.java:496)
at org.hibernate.event.def.DefaultLoadEventListener.doLoad(DefaultLoadEventListener.java:477)
at org.hibernate.event.def.DefaultLoadEventListener.load(DefaultLoadEventListener.java:227)
at org.hibernate.event.def.DefaultLoadEventListener.lockAndLoad(DefaultLoadEventListener.java:403)
at org.hibernate.event.def.DefaultLoadEventListener.onLoad(DefaultLoadEventListener.java:155)
at org.hibernate.impl.SessionImpl.fireLoad(SessionImpl.java:1090)
at org.hibernate.impl.SessionImpl.get(SessionImpl.java:1075)
at org.hibernate.impl.SessionImpl.get(SessionImpl.java:1066)
at com.liferay.portal.dao.orm.hibernate.SessionImpl.get(SessionImpl.java:201)
... 38 more
Caused by: java.sql.SQLTransientConnectionException: HikariPool-2 - Connection is not available, request timed out after 937980ms.
at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:591)
at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:194)
at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:146)
at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:112)
at org.springframework.jdbc.datasource.LazyConnectionDataSourceProxy$LazyConnectionInvocationHandler.getTargetConnection(LazyConnectionDataSourceProxy.java:403)
at org.springframework.jdbc.datasource.LazyConnectionDataSourceProxy$LazyConnectionInvocationHandler.invoke(LazyConnectionDataSourceProxy.java:376)
at com.sun.proxy.$Proxy7.prepareStatement(Unknown Source)
at org.hibernate.jdbc.AbstractBatcher.getPreparedStatement(AbstractBatcher.java:534)
at org.hibernate.jdbc.AbstractBatcher.getPreparedStatement(AbstractBatcher.java:452)
at org.hibernate.jdbc.AbstractBatcher.prepareQueryStatement(AbstractBatcher.java:161)
at org.hibernate.loader.Loader.prepareQueryStatement(Loader.java:1700)
at org.hibernate.loader.Loader.doQuery(Loader.java:801)
at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:274)
at org.hibernate.loader.Loader.loadEntity(Loader.java:2037)
... 50 more
I'm actually out of other idea to test, any help would be thankfull.

Finally we found the solution.
On pre-production environment the two vms are on the same VLAN, which was not the case on production.
Solution: putting the vms on the same VLAN solve the problem.

Related

ubuntu postgresql 12 : Out of memory: Killed process (postgres)

I am using unbuntu 20.04 and postgresql 12 .
My memory is 128 GB , SDD is 1TB. Cpu is i7 (16 core 20 threads)
I made simple c++ program which connect postgresql and generting tile map (just map image).
It's similar with osmtilemaker.
Once program starts, it takes serveral hours to servral months to finish job.
For first 4~5 hours, it runs well.
I monitored memory usage, and It doesnot occupy more than 10% of whole Memory at most.
Here is the screenshot of top command
Tasks: 395 total, 16 running, 379 sleeping, 0 stopped, 0 zombie
%Cpu(s): 70.3 us, 2.9 sy, 0.0 ni, 26.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 128565.5 total, 111124.7 free, 11335.7 used, 6105.1 buff/cache
MiB Swap: 2048.0 total, 2003.6 free, 44.4 used. 115108.6 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2217811 postgres 20 0 2369540 1.7g 678700 R 99.7 1.3 25:27.36 postgres
2217407 postgres 20 0 2393448 1.7g 678540 R 99.0 1.4 25:32.04 postgres
2217836 postgres 20 0 2352936 1.7g 679348 R 98.0 1.3 25:26.22 postgres
2217715 postgres 20 0 2368268 1.7g 680144 R 97.7 1.3 25:29.78 postgres
2217684 postgres 20 0 2384308 1.7g 679248 R 97.3 1.4 25:29.49 postgres
2217539 postgres 20 0 2386156 1.7g 680124 R 97.0 1.4 25:30.46 postgres
2216651 postgres 20 0 2429348 1.8g 678128 R 95.7 1.4 26:05.99 postgres
2217025 postgres 20 0 2396408 1.7g 679292 R 94.4 1.4 25:51.85 postgres
2238487 postgres 20 0 1294752 83724 54024 R 14.3 0.1 0:00.43 postgres
2238488 postgres 20 0 1294968 219304 189116 R 14.0 0.2 0:00.42 postgres
2238489 postgres 20 0 1294552 85624 56068 R 12.6 0.1 0:00.38 postgres
2062928 j 20 0 861492 536088 47396 S 6.6 0.4 19:18.64 mapTiler
2238490 postgres 20 0 1290132 73244 48084 R 6.3 0.1 0:00.19 postgres
2238491 postgres 20 0 1289876 73064 48160 R 6.3 0.1 0:00.19 postgres
928763 postgres 20 0 1181720 61368 59300 S 0.7 0.0 11:59.45 postgres
1306124 j 20 0 19668 2792 2000 S 0.3 0.0 0:06.84 screen
2238492 postgres 20 0 1273864 49192 40108 R 0.3 0.0 0:00.01 postgres
2238493 postgres 20 0 1273996 50172 40852 R 0.3 0.0 0:00.01 postgres
1 root 20 0 171468 9564 4864 S 0.0 0.0 0:09.40 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.02 kthreadd
I used 8 threads in program, so 8 process is using cpu a lot ,
but memory usage is always below 10%.
But, after 4~5 hours, oom-killer killed postgres processes and program stopped running.
Here is the result of dmesg
[62585.503398] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),
cpuset=/,mems_allowed=0,global_oom,
task_memcg=/system.slice/system-postgresql.slice/postgresql#12-main.service,
task=postgres,pid=463942, uid=129
[62585.503406] Out of memory: Killed process 463942 (postgres)
total-vm:19010060kB, anon-rss:17369476kB, file-rss:0kB,
shmem-rss:848380kB, UID:129 pgtables:36776kB oom_score_adj:0
It looks like out of memory error.
But, how that can happend when I have free memory more than 100GB?

Understanding get_misses in memaslap

I am trying to understand the behaviour of memcached by running memaslap against it.
The command I use is
memaslap -s localhost:11212 -T 64 -c 64 --verbose -S 1s -t 30s
and this is the output I get (on local machine)
Get Statistics (3157196 events)
Min: 36
Max: 17957
Avg: 545
Geo: 533.99
Std: 167.05
Log2 Dist:
4: 0 0 11 271
8: 2813 1379137 1706845 66407
12: 1110 446 152 4
Set Statistics (350829 events)
Min: 60
Max: 16184
Avg: 547
Geo: 536.33
Std: 161.65
Log2 Dist:
4: 0 0 3 28
8: 313 151210 191252 7861
12: 112 40 10
Total Statistics (3508025 events)
Min: 36
Max: 17957
Avg: 545
Geo: 534.22
Std: 167.17
Log2 Dist:
4: 0 0 14 299
8: 3126 1530347 1898097 74268
12: 1222 486 162 4
cmd_get: 3157252
cmd_set: 350837
get_misses: 1716802
written_bytes: 608669765
read_bytes: 1610227982
object_bytes: 381710656
Run time: 30.0s Ops: 3508089 TPS: 116914 Net_rate: 70.5M/s
As you can see I have get_misses which amount to 1716802, which is more than 50% of the get requests sent by memaslap. I am curious about what a get_miss means. I checked the docs at
http://docs.libmemcached.org/bin/memaslap.html
and plainly says
get_misses
How many objects can’t be gotten from server
What does it mean to not get a key from server? Is it because the key is not present or is it because of latencies?
If its because the key is not present, what exactly is memaslap testing by sending this request?
It says how many times we've got cache miss while running the get command, in other words - how many times key was not found.
The main reason why there are cache misses is eviction. Memcached throws some items out of memory to free space for the new ones.
Number of get_misses should drop if memcached would have more RAM to use (for instance: -m 1G for 1 Gb of RAM)

20 seconts difference betwen ntp-synced servers

I've got several CentOS 6 servers, synced to pool.ntp.org time-servers.
But sometimes time on them is out of sync, which make difference for 20-30 seconds, which causes errors in my app.
What can be the cause of this, and where should I look for it?
Config
tinker panic 1000 allan 1500 dispersion 15 step 0.128 stepout 900
statsdir /var/log/ntpstats/
leapfile /etc/ntp.leapseconds
driftfile /var/lib/ntp/ntp.drift
statistics loopstats peerstats clockstats
filegen loopstats file loopstats type day enable
filegen peerstats file peerstats type day enable
filegen clockstats file clockstats type day enable
disable monitor
server 0.pool.ntp.org iburst minpoll 6 maxpoll 10
restrict 0.pool.ntp.org nomodify notrap noquery
server 1.pool.ntp.org iburst minpoll 6 maxpoll 10
restrict 1.pool.ntp.org nomodify notrap noquery
server 2.pool.ntp.org iburst minpoll 6 maxpoll 10
restrict 2.pool.ntp.org nomodify notrap noquery
server 3.pool.ntp.org iburst minpoll 6 maxpoll 10
restrict 3.pool.ntp.org nomodify notrap noquery
restrict default kod notrap nomodify nopeer noquery
restrict 127.0.0.1 nomodify
restrict -6 default kod notrap nomodify nopeer noquery
restrict -6 ::1 nomodify
server 127.127.1.0 # local clock
fudge 127.127.1.0 stratum 10
srv1
remote refid st t when poll reach delay offset jitter
==============================================================================
server01.coloce .STEP. 16 u 11d 1024 0 0.000 0.000 0.000
+mt4.raqxs.net 193.190.230.66 2 u 510 1024 377 6.367 5.984 7.433
+16-164-ftth.ons 193.79.237.14 2 u 217 1024 375 11.339 -0.028 4.564
*services.freshd 213.136.0.252 2 u 419 1024 377 6.735 2.048 4.321
LOCAL(0) .LOCL. 10 l - 64 0 0.000 0.000 0.000
srv2
remote refid st t when poll reach delay offset jitter
==============================================================================
+ntp2.edutel.nl 80.94.65.10 2 u 527 1024 377 11.924 1.469 0.753
-95.211.224.12 193.67.79.202 2 u 364 1024 377 12.989 4.930 0.628
+app.kingsquare. 193.79.237.14 2 u 339 1024 377 5.485 0.493 0.591
*ntp.bserved.nl 193.67.79.202 2 u 206 1024 377 7.007 0.539 0.420
LOCAL(0) .LOCL. 10 l - 64 0 0.000 0.000 0.000

Exporting to .csv format from matlab or log file

I am running a simulation which displays a list of parameters at every time step and displays it on the command window. It also generates a separate log file which shows all the parameters at each time step.Is there a way where i can export into a .csv file in a structured manner.There is no structure in which the output is displayed.Every row has different number of parameters.
Part of output looks like this
Time = 80.0000 sec =======================================
rmc_gr 22 DTT_pos_in: 5 Veh_Spd(mph): 1.9 fhtyi_st_pr_des: 3 ttgyda_pc_dd: 0.000
rmip_pc_dot_dd: -0.000 rmssh__thld: 0.000 rmssh_tm_tin_exit: 0.000 rmssh_gr_des: 1
tin_ne: 676 tdc_t_bar: 602 tdc_nbar: 407 tdc_n_isb_bar: 0
tfp_nerq: 7000 tcrpmbar: 408 tpm__bar: 89
r56sc_gr_t: 22 sacr_d_ele: 0 saprc_p_lnp: 180
saseq_gr: 22 sautl_gr_end: 0 sautl_gr_strt: 0 sautl_gr_end: 22
sautl_grstrt: 0 sa_slp: 0 satq_tqhld: 0.000000 sacorst: 0

uwsgi long timeouts

I am using ubuntu 12, nginx, uwsgi 1.9 with socket, django 1.5.
Config:
[uwsgi]
base_path = /home/someuser/web/
module = server.manage_uwsgi
uid = www-data
gid = www-data
virtualenv = /home/someuser
master = true
vacuum = true
harakiri = 20
harakiri-verbose = true
log-x-forwarded-for = true
profiler = true
no-orphans = true
max-requests = 10000
cpu-affinity = 1
workers = 4
reload-on-as = 512
listen = 3000
Client tests from Windows7:
C:\Users\user>C:\AppServ\Apache2.2\bin\ab.exe -c 255 -n 5000 http://www.someweb.com/about/
This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright 2006 The Apache Software Foundation, http://www.apache.org/
Benchmarking www.someweb.com (be patient)
Completed 500 requests
Completed 1000 requests
Completed 1500 requests
Completed 2000 requests
Completed 2500 requests
Completed 3000 requests
Completed 3500 requests
Completed 4000 requests
Completed 4500 requests
Finished 5000 requests
Server Software: nginx
Server Hostname: www.someweb.com
Server Port: 80
Document Path: /about/
Document Length: 1881 bytes
Concurrency Level: 255
Time taken for tests: 66.669814 seconds
Complete requests: 5000
Failed requests: 1
(Connect: 1, Length: 0, Exceptions: 0)
Write errors: 0
Total transferred: 10285000 bytes
HTML transferred: 9405000 bytes
Requests per second: 75.00 [#/sec] (mean)
Time per request: 3400.161 [ms] (mean)
Time per request: 13.334 [ms] (mean, across all concurrent requests)
Transfer rate: 150.64 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 8 207.8 1 9007
Processing: 10 3380 11480.5 440 54421
Waiting: 6 1060 3396.5 271 48424
Total: 11 3389 11498.5 441 54423
Percentage of the requests served within a certain time (ms)
50% 441
66% 466
75% 499
80% 519
90% 3415
95% 36440
98% 54407
99% 54413
100% 54423 (longest request)
I have set following options too:
echo 3000 > /proc/sys/net/core/netdev_max_backlog
echo 3000 > /proc/sys/net/core/somaxconn
So,
1) I make first 3000 requests super fast. I see progress in ab and in uwsgi requests logs -
[pid: 5056|app: 0|req: 518/4997] 80.114.157.139 () {30 vars in 378 bytes} [Thu Mar 21 12:37:31 2013] GET /about/ => generated 1881 bytes in 4 msecs (HTTP/1.0 200) 3 headers in 105 bytes (1 switches on core 0)
[pid: 5052|app: 0|req: 512/4998] 80.114.157.139 () {30 vars in 378 bytes} [Thu Mar 21 12:37:31 2013] GET /about/ => generated 1881 bytes in 4 msecs (HTTP/1.0 200) 3 headers in 105 bytes (1 switches on core 0)
[pid: 5054|app: 0|req: 353/4999] 80.114.157.139 () {30 vars in 378 bytes} [Thu Mar 21 12:37:31 2013] GET /about/ => generated 1881 bytes in 4 msecs (HTTP/1.0 200) 3 headers in 105 bytes (1 switches on core 0)
I dont have any broken pipes or worker respawns.
2) Next requests are running very slow or with some timeout. Looks like that some buffer becomes full and I am waiting before it becomes empty.
3) Some buffer becomes empty.
4) ~500 requests are processed super fast.
5) Some timeout.
6) see Nr. 4
7) see Nr. 5
8) see Nr. 4
9) see Nr. 5
....
....
Need your help
check with netstat and dmesg. You have probably exhausted ephemeral ports or filled the conntrack table.