Possible Stuckness: Google Cloud PubSub to Cloud Storage - google-cloud-storage

I have a Dataflow streaming job that writes PubSub messages to a file that gets stored in Cloud Storage in 3-minute windows. After a few hours I notice on the "Data Freshness by stages" graph it displays "Possible Stuckness" and "Possible slowness".
I have checked the logs and the info logs displays the follow: "Setting socket default timeout to 60 seconds."; "socket default timeout is 60.0 seconds."; "Attempting refresh to obtain initial access_token."; "Refreshing due to a 401 (attempt 1/2)". That last log kept repeating every few minutes for four hours before the job displayed that there was possible slowness/stuckness.
I am not entirely sure what is happening here. Are these logs related to why the job slowed down and got stuck?

The "potential stuckness" and "potential slowness" are basically the same thing, they are documented here.
The logs might be red herrings.
You can view all available logs following here by their categories: job-message, worker, worker-startup and etc. Try
identify if there is any worker logs to determine whether workers are successfully started with dependencies installed;
search "Operation ongoing" to see whether any work item is taking too much time;
search if there is any error in workers that is blocking the streaming job from making progress.

Related

ADF Dataflow stuck IN progress and fail with below errors

ADF Pipeline DF task is Stuck in Progress. It was working seamlessly last couple of months but suddenly Dataflow stuck in progress and Time out after certain time. We are using IR managed Virtual Network. I am using forereach loop to run data flow for multiple entities parallel, it always randomly get stuck on last Entity.
What can I try to resolve this?
Error in Dev Environment
Error Code 4508
Spark cluster not found
Error in Prod Environment:
Error code
5000
Failure type
User configuration issue
Details
[plugins.*** ADF.adf-ir-001 WorkspaceType:<ADF> CCID:<f289c067-7c6c-4b49-b0db-783e842a5675>] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
Images:
I tried below steps:
By changing IR configuring as below
Tried DF Retry and retry Interval
Also, tried For each loop one batch at a time instead of 4 batch parallel. None of the above trouble-shooting steps worked. These PL is running last 3-4 months without a single failure, suddenly they started to fail last 3 days consistently. DF flow always stuck in progress randomly for different entity and times out in one point by throwing above errors.
Error Code 4508 Spark cluster not found.
This error can cause because of two reasons.
The debug session is getting closed till the dataflow finish its transformation in this case recommendation is to restart the debug session
the second reason is due to resource problem, or an outage in that particular region.
Error code 5000 Failure type User configuration issue Details [plugins.*** ADF.adf-ir-001 WorkspaceType: CCID:] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
A temporary error is one that says "Livy job state dead caused by unknown error." At the backend of the dataflow, a spark cluster is used, and this error is generated by the spark cluster. to get the more information about error go to StdOut of sparkpool execution.
The backend cluster may be experiencing a network problem, a resource problem, or an outage.
If error persist my suggestion is to raise Microsoft support ticket here

How to check about Redshift maintenance windows

As usual we set Redshift maintenance windows on Saturday morning, and we got several errors during that maintenance windows time.
* Query Processing Error AM5:07:01
[Amazon](500051) ERROR processing query/statement. Error: Query execution failed
[SQL State=HY000, DB Errorcode=500051]
* Connection Error AM5:07:27.79
[Amazon](500150) Error setting/closing connection: Connection refused: connect.
I guess that's due to Redshift internal maintenance.
May I ask how to check any evidence to prove that on Redshift? I checked the svl_qlog with aborted=1, but couldn't find perfect one.
And is there any way to set maintenance window to skip when the user session is running on?
--
Thanks to useful information from Schepo and Bill, we could prove that connection error was due to reboot on Redshift Maintenance Window.
Also, we checked Redshift Event at Console, exactly what time Redshift reboot started and ended.
Probably the best way to check if the connection errors were due to Redshift maintenance would be to check the Maintenance tab in your cluster configuration. In the example screenshot below, it's some time between 06:30 and 07:00 am every Wednesday.
There's no way to stop it happening while user sessions are connected. Although you do have the option of deferring all maintenance for up to 45 days if you need (follow the Edit button on the same screen).
For evidence to prove, you can check the audit log of past maintenance events by looking in the AWS Config service under the "timeline" of your cluster. Follow the View Config Timeline button to open AWS Config for that cluster. In the below example screenshot you can see the exact time (08:49:20) of one maintenance window in the past.
Another way to document that the maintenance window was used is to check the "healthy" dashboard metric on the console or in CloudWatch. If the cluster went unhealthy then returned to healthy during the maintenance window is very likely that AWS performed an update on the systems.

Batch account node restarted unexpectedly

I am using an Azure batch account to run sqlpackage.exe in order to move databases from a server to another. A task that has started 6 days ago has suddenly been restarted and started from the beginning after 4 days of running (extremely large databases). The task run uninterruptedly up until then and should have continued to run for about 1-2 days.
The PowerShell script that contains all the logic handles all the exceptions that could occur during the execution. Also, the retry count for the task was set to 0 in case it fails.
Unfortunately, I did not have diagnostics settings configured and I could only look at the metrics and there was a short period when there wasn't any node.
What can be the causes for this behavior? Restarting while the node is still running
Thanks
Unfortunately, there is no way to give a definitive answer to this question. You will need to dig into the compute node (interactively log in) and check system logs to give you details on why the node restarted. There is no guarantee that a compute node will have 100% uptime as there may be hardware faults or other service interruptions.
In general, it's best practice to have long running tasks checkpoint progress combined with a retry policy. Programs that can reload state can pick up at the time of the checkpoint when the Batch service automatically reschedules the task execution. Please see the Batch best practices guide for more information.

The target server failed to respond for multiple iterations in Jmeter

In my Jmeter script,I am getting error for 2nd iteration.
For multiple users with single iteration, no errors were observed, but with multiple iterations am getting error with below message
Response code: Non HTTP response code: org.apache.http.NoHttpResponseException
Response message: Non HTTP response message: The target server failed to respond
Response data is The target server failed to respond
Error Snapshot
Could you please suggest me what could be reason behind this error
Thanks in advance
Most likely your server becomes overloaded. In regards to possible reason my expectation is that single iteration does not deliver the full concurrency as JMeter acts like:
JMeter starts all the virtual users within the specified ramp-up period
Each virtual user starts executing samplers
When there are no more samplers to execute and no loops to iterate - the thread is being shut down
So with 1 iteration you may run into situation when some threads have already finished their job and the others have not been started yet. When you add more iterations the "old" threads start over and "new" are arriving. The situation is explained in the JMeter Test Results: Why the Actual Users Number is Lower than Expected article and you can monitor the actual delivered load using Active Threads Over Time chart of the HTML Reporting Dashboard or Active Threads Over Time Listener available via JMeter Plugins
To get to the bottom of the failure I would recommend checking the following:
components logs on the application under test side (application logs, application/web server logs, database logs)
application under test baseline health metrics (CPU, RAM, Disk, etc.). You can use JMeter PerfMon Plugin, this way you will be able to correlate increasing load with resources consumption

Node-red app in Bluemix crashes when performance testing

I have a node-red app in Bluemix that contains 2 flows.
The first flow has 3 nodes: an Http In node, a function that reformats from 1 json object to another and an Mqlight node.
The 2nd flow has an mqlignt input, a batcher node to batch so many messages together, a couple nodes to reformat and then an http request node to put the message to a cloudant database.
I have been trying to performance test this. I feed it 1000-5000 messages over a few minutes and it crashes before all the messages are put to the database. The error just says exit status 255: CRASHED. I do not see any additional data in the logs.
Any help would be appreciated.
See attached screen prints.
Memory usage: 353MB/1.625GB
Disk usage: 333MB/1GB
CPU: .3%
CRASH ERROR UPDATED 4/4/2016 with the error from the crash