Node-red app in Bluemix crashes when performance testing - ibm-cloud

I have a node-red app in Bluemix that contains 2 flows.
The first flow has 3 nodes: an Http In node, a function that reformats from 1 json object to another and an Mqlight node.
The 2nd flow has an mqlignt input, a batcher node to batch so many messages together, a couple nodes to reformat and then an http request node to put the message to a cloudant database.
I have been trying to performance test this. I feed it 1000-5000 messages over a few minutes and it crashes before all the messages are put to the database. The error just says exit status 255: CRASHED. I do not see any additional data in the logs.
Any help would be appreciated.
See attached screen prints.
Memory usage: 353MB/1.625GB
Disk usage: 333MB/1GB
CPU: .3%
CRASH ERROR UPDATED 4/4/2016 with the error from the crash

Related

ADF Dataflow stuck IN progress and fail with below errors

ADF Pipeline DF task is Stuck in Progress. It was working seamlessly last couple of months but suddenly Dataflow stuck in progress and Time out after certain time. We are using IR managed Virtual Network. I am using forereach loop to run data flow for multiple entities parallel, it always randomly get stuck on last Entity.
What can I try to resolve this?
Error in Dev Environment
Error Code 4508
Spark cluster not found
Error in Prod Environment:
Error code
5000
Failure type
User configuration issue
Details
[plugins.*** ADF.adf-ir-001 WorkspaceType:<ADF> CCID:<f289c067-7c6c-4b49-b0db-783e842a5675>] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
Images:
I tried below steps:
By changing IR configuring as below
Tried DF Retry and retry Interval
Also, tried For each loop one batch at a time instead of 4 batch parallel. None of the above trouble-shooting steps worked. These PL is running last 3-4 months without a single failure, suddenly they started to fail last 3 days consistently. DF flow always stuck in progress randomly for different entity and times out in one point by throwing above errors.
Error Code 4508 Spark cluster not found.
This error can cause because of two reasons.
The debug session is getting closed till the dataflow finish its transformation in this case recommendation is to restart the debug session
the second reason is due to resource problem, or an outage in that particular region.
Error code 5000 Failure type User configuration issue Details [plugins.*** ADF.adf-ir-001 WorkspaceType: CCID:] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
A temporary error is one that says "Livy job state dead caused by unknown error." At the backend of the dataflow, a spark cluster is used, and this error is generated by the spark cluster. to get the more information about error go to StdOut of sparkpool execution.
The backend cluster may be experiencing a network problem, a resource problem, or an outage.
If error persist my suggestion is to raise Microsoft support ticket here

Possible Stuckness: Google Cloud PubSub to Cloud Storage

I have a Dataflow streaming job that writes PubSub messages to a file that gets stored in Cloud Storage in 3-minute windows. After a few hours I notice on the "Data Freshness by stages" graph it displays "Possible Stuckness" and "Possible slowness".
I have checked the logs and the info logs displays the follow: "Setting socket default timeout to 60 seconds."; "socket default timeout is 60.0 seconds."; "Attempting refresh to obtain initial access_token."; "Refreshing due to a 401 (attempt 1/2)". That last log kept repeating every few minutes for four hours before the job displayed that there was possible slowness/stuckness.
I am not entirely sure what is happening here. Are these logs related to why the job slowed down and got stuck?
The "potential stuckness" and "potential slowness" are basically the same thing, they are documented here.
The logs might be red herrings.
You can view all available logs following here by their categories: job-message, worker, worker-startup and etc. Try
identify if there is any worker logs to determine whether workers are successfully started with dependencies installed;
search "Operation ongoing" to see whether any work item is taking too much time;
search if there is any error in workers that is blocking the streaming job from making progress.

Why would running a container on GCE get stuck Metadata request unsuccessful forbidden (403)

I'm trying to run a container in a custom VM on Google Compute Engine. This is to perform a heavy ETL process so I need a large machine but only for a couple of hours a month. I have two versions of my container with small startup changes. Both versions were built and pushed to the same google container registry by the same computer using the same Google login. The older one works fine but the newer one fails by getting stuck in an endless list of the following error:
E0927 09:10:13 7f5be3fff700 api_server.cc:184 Metadata request unsuccessful: Server responded with 'Forbidden' (403): Transport endpoint is not connected
Can anyone tell me exactly what's going on here? Can anyone please explain why one of my images doesn't have this problem (well it gives a few of these messages but gets past them) and the other does have this problem (thousands of this message and taking over 24 hours before I killed it).
If I ssh in to a GCE instance then both versions of the container pull and run just fine. I'm suspecting the INTEGRITY_RULE checking from the logs but I know nothing about how that works.
MORE INFO: this is down to "restart policy: never". Even a simple Centos:7 container that says "hello world" deployed from the console triggers this if the restart policy is never. At least in the short term I can fix this in the entrypoint script as the instance will be destroyed when the monitor realises that the process has finished
I suggest you try creating a 3rd container that's focused on the metadata service functionality to isolate the issue. It may be that there's a timing difference between the 2 containers that's not being overcome.
Make sure you can ‘curl’ the metadata service from the VM and that the request to the metadata service is using the VM's service account.

The target server failed to respond for multiple iterations in Jmeter

In my Jmeter script,I am getting error for 2nd iteration.
For multiple users with single iteration, no errors were observed, but with multiple iterations am getting error with below message
Response code: Non HTTP response code: org.apache.http.NoHttpResponseException
Response message: Non HTTP response message: The target server failed to respond
Response data is The target server failed to respond
Error Snapshot
Could you please suggest me what could be reason behind this error
Thanks in advance
Most likely your server becomes overloaded. In regards to possible reason my expectation is that single iteration does not deliver the full concurrency as JMeter acts like:
JMeter starts all the virtual users within the specified ramp-up period
Each virtual user starts executing samplers
When there are no more samplers to execute and no loops to iterate - the thread is being shut down
So with 1 iteration you may run into situation when some threads have already finished their job and the others have not been started yet. When you add more iterations the "old" threads start over and "new" are arriving. The situation is explained in the JMeter Test Results: Why the Actual Users Number is Lower than Expected article and you can monitor the actual delivered load using Active Threads Over Time chart of the HTML Reporting Dashboard or Active Threads Over Time Listener available via JMeter Plugins
To get to the bottom of the failure I would recommend checking the following:
components logs on the application under test side (application logs, application/web server logs, database logs)
application under test baseline health metrics (CPU, RAM, Disk, etc.). You can use JMeter PerfMon Plugin, this way you will be able to correlate increasing load with resources consumption

Partition is in quorum loss

I have a Service Fabric application that has a stateless web api and a stateful service with two partitions. The stateless web api defines a web api controller and uses ServiceProxy.Create to get a remoting proxy for the stateful service. The remoting call puts a message into a reliable queue.
The stateful service will dequeue the messages from the queue every X minutes.
I am looking at the Service Fabric explorer and my application has been in an error state for the past few days. When I drill down into the details the stateful service has the following error:
Error event: SourceId='System.FM', Property='State'. Partition is in
quorum loss.
Looking at the explorer I see that I have my primary replica up and running and it seems like a single ActiveSecondary, but the other two replicas show IdleSecondary and they keep going into a Standby / In Build state. I cannot figure out why this is happening.
What are some of the reasons my other secondaries keep failing to get to an ActiveSecondary state / causing this quorum loss?
Try to reset the cluster.
I was facing the same issue having 1 partition for my service.
The error was fixed with resetting the cluster
Have you checked the Windows Event Log on the nodes for additional error message?
I had a similar problem, except I was using a ReliableDictionary. Did you properly implement IEquatable<T> and IComparable<T>? I had a similar problem because my T had a dictionary field, and I was calling Equals on a dictionary directly, instead of comparing the keys and values. Same thing for GetHashCode.
The clue in the event logs was this message: Assert=Cannot update an item that does not exist (null). - it only happened when I edited a key ReliableDictionary.