I've been doing some performance testing on Orion Context Broker sending many POST requests to update an attribute inside an entitiy and then check that the subscriptors receive the message correctly. The environment that I built is shown in the next image Environment
To deploy all the components in my server I'm using docker and they comunicate to each other as in an internal network(docker network). A sensor is simulated with a script that send data to Orion. If I want to send 800 post requests per minute, I send a request each 0.075s(60/800).
With the default configuration of Orion(using the docker image from https://hub.docker.com/r/fiware/orion/) when the number of request is around 800, Orion fails and then I receive an "Empty reply from server". I understand that this happens because the default mode for notifications is trasient and this is not suitable for high load scenarios.
I've tried changing the default parameters building another orion image using threadpool mode and test diferent numbers for n(number of workers) and q(queue limit), for example n=8 and q=80, and changed the reqPoolSize to 8(I am working on a machine with 8 cores and 8GB RAM). But now, with around 500 alerts Orion hungs for a few minutes and then works well. I checked the other components in the environment and they seem to be working right. Are there another parameter that I should be looking at, or shoud I keep testing more numbers for the ones that I'm using?
Related
We had a use case where our client by mistake sends 10000 request, for each request we had to immediately send a ID and then enrich the data with multiple DB calls/rest call and finally response back to Kafka topic. Due to this much processing the whole system went down including the underlined system as along with Kakfa we also publish to MQ which again goes for further processing.
Ask is to control the number of requests client can send, we though of controlling it to store the threshold in DB based on per day or per hours and start rejecting once they reach the threshold but this requires computation and DB hits.
Is there any tool or out of box solution with minimum effort and without adding performance load to system? We are looking for some kind of Back pressure technique like in Spring webflux etc.
It is spring boot application on Java 11.
Backpressure works the other way around, when you are consuming a client that is emitting more than you can process. What you are looking for is Rate Limiter.
You can take a look at the Resilience4J Rate Limiter
We have a production system which is an ASP.NET Web API (classic, not .NET Core) application published to Azure. Data storage is Azure SQL Database and we use Entity Framework to access the data. API has a medium load, 10-60 requests per second and upper_90 latency is 100-200 ms which is a target latency is our case. Some time ago we noticed that approximately every 20-30 minutes our services stalls and latency jumps to approximately 5-10 sec. All requests start to be slow for about a minute and then the system recovers by itself. Same time no requests are being dropped, they all just take longer to execute. for a short period of time (usually 1 minute).
We start to see the following picture at our HTTP requests telemetry (Azure):
We can also see a correlation to with our Azure SQL Database metrics, such as DTU (drop) and connections (increase):
We've analyzed the server and didn't see any correlation with the host (we have just one host) CPU/Memory usage, it's stable at 20-30% CPU usage level and 50% memory usage.
We also have an alternative source of telemetry which shows the same behavior. Our telemetry measures API latency and database metrics such as active connection count and pooled connection count (ADO.NET Connection Pool):
What is interesting, that every system stall is accompanied by a pooled connection quantity raise. And our tests show, the more connection pooled, the longer you spend waiting on a new connection from that pool to execute your next database operation. We analyzed a few suggestions but were unable to prove or disprove any of them:
ADO.NET connection leak (all our db access happens in a using statement with proper connection disposal/return to pool)
Socket/Port Exhaustion - where unable to properly track telemetry on that metric
CPU/Memory bottleneck - charts shows there is none
DTU (database units) bottleneck - charts shows there is none
As of now we are trying to identify the possible culprit of this behavior. Unfortunately, we cannot identify the changes which led to it becuase of missing telemetry, so now the only way to deal with the issue is to properly diagnose it. And, of course, we can only reproduce it in production, under permanent load (even when load is not high like 10 requests a second).
What are the possible causes for this behavior and what is the proper way to diagnose and troubleshoot it?
There can be several possible reasons:
The problem could be in your application code, create a staging environment and re-run your test with profiler tool telemetry (i.e. using YourKit .NET Profiler) - this will allow you to detect the heaviest methods, largest objects, slowest DB queries, etc.Also do a load test on your API with JMeter.
I would recommend you to try Kudu Process API to look at the list of currently running processes, and get more info about them list their CPU time.
The article for how to monitor CPU using in Azure App service are shown below:
https://azure.microsoft.com/en-in/documentation/articles/web-sites-monitor/
https://azure.microsoft.com/en-in/documentation/articles/app-insights-web-monitor-performance/
We ended up separating a few web apps hosted at a single App Service Plan. Even though the metrics were not showing us any bottle neck with the CPU on the app, there are other apps which cause CPU usage spikes and as a result Connection Pool Queue growth with huge Latency spikes.
When we checked the App Service Plan usage and compared it to the Database plan usage, it became clear that the bottleneck is in the App Service Plan. It's still hard to explain while CPU bottleneck causes uneven latency spikes but we decided to separate the most loaded web app to a separate plan and deal with it in isolation. After the separation the app behave normally, no CPU or Latency spikes and it look very stable (same picture as between spikes):
We will continue to analyze the other apps and eventually will find the culprit but at this point the mission critical web app is in isolation and very stable. The lesson here is to monitor not only Web App resources usage but also a hosting App Service Plan which could have other apps consuming resources (CPU, Memory)
I have used Sails framework for my web application. Now we are calling services from mobile app.
I found that the Sails executes only 4 request at a time. Is there any way I can increase that?
Are you sure about that?
I've tested my sails app multiple times with: https://github.com/alexfernandez/loadtest
For example: loadtest http://localhost:1337 --rps 150 -c 20 -k
You can set it to any route of your app, and send GET/POST/PUT/DEL requests. The -c option means concurrent, so you can play with that number and check your app limits, but 5 concurrent is too low.
Most likely your browser or http client is limiting the number of requests per server. Refer to https://stackoverflow.com/a/985704/401025 or lookup the maximum number of requests from the manual of your http client.
There are no tcp connection limits imposed by sails or by nodejs. In matter of fact nodejs is known to be able to handle thousands of simultaneous connections.
Without knowing your server/nodejs setup it is hard to say why you are seeing a limit of 4 connections.
Depending on which nodejs version you use, you might check the http module of nodejs and following attributes:
https://nodejs.org/api/net.html#net_server_maxconnections
Set this property to reject connections when the server's connection
count gets high.
https://nodejs.org/api/http.html#http_agent_maxsockets
By default set to Infinity. Determines how many concurrent sockets the
agent can have open per origin. Origin is either a 'host:port' or
'host:port:localAddress' combination.
I'm new to Eureka and I see this information from the home page of my Eureka server(localhost:8761/). I didn't find any explanation from official docs about 'Renews' and 'Renews threshold'. Could any one please explain these words? Thanks!
Hope it helps:
Renews: total number of heartbeats the server received from clients
Renews threshold: a switch which controls the "self-preservation mode" of Eureka. If "Renews" is below "Renews threshold", the "self-preservation mode" is on.
self-preservation mode:
When the Eureka server comes up, it tries to get all of the instance registry information from a neighboring node. If there is a problem getting the information from a node, the server tries all of the peers before it gives up. If the server is able to successfully get all of the instances, it sets the renewal threshold that it should be receiving based on that information. If any time, the renewals falls below the percent configured for that value (below 85% within 15 mins), the server stops expiring instances to protect the current instance registry information.
In Netflix, the above safeguard is called as self-preservation mode and is primarily used as a protection in scenarios where there is a network partition between a group of clients and the Eureka Server. In these scenarios, the server tries to protect the information it already has. There may be scenarios in case of a mass outage that this may cause the clients to get the instances that do not exist anymore. The clients must make sure they are resilient to eureka server returning an instance that is non-existent or un-responsive. The best protection in these scenarios is to timeout quickly and try other servers.
For more details, please refer to the Eureka wiki.
It seems that any new send or receive ports that I create do not display any tracking even if I tick all the tracking boxes. I have an existing application and the receive port and orchestration tracking work, but the send port tracking doesn't.
On the same machine I also tried creating a new application. Created a send and a receive port and no tracking at all. I did the same thing on a fresh install of biztalk on another machine and I got tracking so I'm not crazy.
I've tried ...
ticking every box in tracking for the receive, orch, send ports.
creating a new host specifically for tracking
recreating the original host with a different name
sql service is running
reboot system
reboot host instances
restart biztalk services
nothing shows in event logs
all sql jobs ok except for 'monitor biztalk' which complains about 7 orphaned dta.
can't see anything in particular that stands out from mbv except for the above mentioned oraphaned dta.
In addition to Mike's answer:
You need to ensure that at least one of your hosts is enabled for tracking. In BizTalk Administrator, under Platform Settings, Hosts, Select the host, and enable tracking (the list of hosts also shows which host(s) are current tracking enabled).
You can also verify that the tracking SQL Agent job is running by looking directly at the database
select count(*) from BizTalkMsgBoxDb.dbo.Spool (NOLOCK)
select count(*) from BizTalkDTADb.dbo.Tracking_Parts1 (NOLOCK)
Basically, spool should be a fairly low number (< 10 000), and should come back to a static level after a spike in messages, unless your suspended orchs are growing.
And new messages should be copied across from the MessageBox to DtaDb.TrackingParts every minute, so Tracking_Parts1 should grow a few records every 60-120 seconds after processing new messages, although they will be eventually purged / archived in line with your tracking archiving / purge strategy.
In a Dev environment, the more tracking the merrier, as HAT (the orchestration debugger) will give you more information the more you track. However, in a PROD environment, you would typically want to minimize tracking to improve performance and reduce disk overhead. We just track one copy, viz 'before processing' on the receive and 'after processing' on the send ports to our partners, and nothing at all on internal Ports and Orchs. This allows us to provide sufficient evidence of data received and sent.
This post might help some people: http://learningcenter2.eworldtree.net:7090/Lists/Posts/Post.aspx?ID=78
For message tracking to work, among other factors, make sure that the "Message send and receive events" checkbox in the corresponding pipeline is enabled.
Please take a look at these two articles, What is Message Tracking? and Insight into BizTalk Server message tracking. The first article has an item of interest for you and I'll quote it below and the second should just solidify what you're trying to do.
The SQL Server Agent service must be running on all MessageBox databases. The TrackedMessages_Copy_ job makes message bodies available to tracking queries and WMI. To efficiently copy the message bodies, they remain in the MessageBox database and are periodically copied to the BizTalk Tracking (BizTalkDTADb) database by the TrackedMessages_Copy_ job. Having the SQL Server Agent service running is also a prerequisite for the archiving and purging process to work correctly.
Are you using a default pipeline? Have you checked the tracking check boxes on them? There is some bug where the pipeline tracking is disabled for default pipelines.
More info here:
http://blog.ibiz-solutions.se/integration/biztalk-global-pipeline-tracking-disabled-unexpectedly/
Please ensure that required tracking is enbled in the properties of the send pipeline used by your send port. If message body tracking is disabled on the send pipeline, nothing is tracked on the send port as well.