Using Confluent Kafka Cloud API endpoint I send HTTP request containing CREATE OR REPLACE STREAM command that is used to update already existent stream. Half of the times, however ksqldb returns 502 Error:
Could not write the statement 'CREATE OR REPLACE STREAM...
Caused by: Another command with the same id
It seems this is a well known issue happening when ksqlDB command fails to commit. It has been "recommended" to send the same command again and again "...as a subsequent retry of the command should work fine".
Another suggestion is to "to clean up the command from the futures map on a commit failure".
Here is a GitHub issue link: https://github.com/confluentinc/ksql/issues/6340
The first recommendation that talks about re-sending the same command multiple times in a row to finally make it work, does not seem to be a good solution. I wonder if the second recommendation to clean up the command from the future map is a better choice?
How do we clean up the command from the future map? Is there any other solution known to make ksqldb reliably execute the ksql command?
Related
ADF Pipeline DF task is Stuck in Progress. It was working seamlessly last couple of months but suddenly Dataflow stuck in progress and Time out after certain time. We are using IR managed Virtual Network. I am using forereach loop to run data flow for multiple entities parallel, it always randomly get stuck on last Entity.
What can I try to resolve this?
Error in Dev Environment
Error Code 4508
Spark cluster not found
Error in Prod Environment:
Error code
5000
Failure type
User configuration issue
Details
[plugins.*** ADF.adf-ir-001 WorkspaceType:<ADF> CCID:<f289c067-7c6c-4b49-b0db-783e842a5675>] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
Images:
I tried below steps:
By changing IR configuring as below
Tried DF Retry and retry Interval
Also, tried For each loop one batch at a time instead of 4 batch parallel. None of the above trouble-shooting steps worked. These PL is running last 3-4 months without a single failure, suddenly they started to fail last 3 days consistently. DF flow always stuck in progress randomly for different entity and times out in one point by throwing above errors.
Error Code 4508 Spark cluster not found.
This error can cause because of two reasons.
The debug session is getting closed till the dataflow finish its transformation in this case recommendation is to restart the debug session
the second reason is due to resource problem, or an outage in that particular region.
Error code 5000 Failure type User configuration issue Details [plugins.*** ADF.adf-ir-001 WorkspaceType: CCID:] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
A temporary error is one that says "Livy job state dead caused by unknown error." At the backend of the dataflow, a spark cluster is used, and this error is generated by the spark cluster. to get the more information about error go to StdOut of sparkpool execution.
The backend cluster may be experiencing a network problem, a resource problem, or an outage.
If error persist my suggestion is to raise Microsoft support ticket here
Error Message - job failed with error message The output of the notebook is too large. Cause: rpc response (of 20972488 bytes) exceeds limit of 20971520 bytes
Details:
We are using databricks notebooks to run the job. Job is running on job cluster. This is a streaming job.
Job started failing with above mentioned error.
We do not have any display(), show(), print(), explain method in the job.
We are not using awaitAnyTermination method in the job as well.
We also tried adding "spark.databricks.driver.disableScalaOutput true" to the job but it still did not work. Job is failing with same error.
We have followed all the steps mentioned in this document - https://learn.microsoft.com/en-us/azure/databricks/kb/jobs/job-cluster-limit-nb-output
Do we have any option to resolve this issue or to find out exactly which commands output is causing it to go above 20MB limit.
See the docs regarding structured streaming in prod.
I would recommend migrating to workflows based on jar jobs because:
Notebook workflows are not supported with long-running jobs. Therefore we don’t recommend using notebook workflows in your streaming jobs.
I am currently using Airflow 1.8.2 to schedule some EMR tasks and then execute some long running queries on our Redshift cluster. For that purpose I am using the postgres_operator. The queries take about 30 minutes to run. However, once they are done, the connection never closes and the operator runs for an hour and a half more till its terminated at the 2 hour mark every time. The message on termination is that the server closed the connection unexpectedly.
I've checked the logs on Redshift's end and it shows the queries have run and the connection has been closed. Somehow, that is never communicated back to Airflow. Any directions of what more I could check would be helpful. To give some more info, my Airflow installation is an extension of the https://github.com/puckel/docker-airflow docker image, is run in an ECS cluster and has SQLite as backend since I am still testing Airflow out. Also, I'm using the sequential executor for the backend. I would appreciate any help in this matter.
We had similar issue before but I am using SQLAlchemy to Redshift, if you are using postgres_operator, it should be very similar. It seems Redshift will close the connection if it doesn't see any activity for a long running query, in your case, 30 mins are pretty long query.
Check https://www.postgresql.org/docs/9.5/static/runtime-config-connection.html
you have three settings, tcp_keepalives_idle, tcp_keepalives_idle, tcp_keepalives_count, that sends a live message to redshift to indicate "Hey, I am still alive.
You can pass the following as argument, so something like this: connect_args={'keepalives': 1, 'keepalives_idle':60, 'keepalives_interval': 60}
I came across a problem and thought of a question I did not find a good answer to. And that is, how can I purposely make an AWS EMR step fail?
I have a Spark Scala script which is added as a Spark step with some command line arguments and the output of the script is written into S3.
But if something goes wrong reading and handling the command line arguments, then the logic of the script is skipped and the script ends. But for EMR it is normal behaviour, it does not know that an if block was not entered.
And after a "failed" run, the step status is still changed to "Completed" and it seems it was successful without the results being written into S3.
I want to finish the step so it would be in "Failed" status.
I can do that by throwing an exception and then I can see the corresponding exception with my message in EMR Step error logs. But is there a better way? I would like to handle all my exceptions myself, manually.
And in addition, can I use AWS SDK to somehow programmatically find out the reason of the step's failure?
Return a non-zero value from your program
To intentionally fail the EMR step, you can always put a silly logic which could fail the code in run-time.
For pyspark we are putting a piece of code as a=5/0. This will fail the code.
Otherwise you can give something like an S3 path which doesn't even exist. This will also fail the job in runtime.
OR
You can write exit(1) to return a non-zero value from your code to fail the EMR step.
I am looking into gcloud log shell command line, I started with a classic sample:
gcloud beta logging write --payload-type=struct my-test-log "{\"message\": \"My second entry\", \"weather\": \"aaaaa\"}"
It works fine so I checked the throughputwith the following code its works veru slaw (about 2 records a sec) is this the best way to do so?
Here is my sample code
tail -F -q -n0 /root/logs/general/*.log | while read line
do
echo $line
b=`date`
gcloud beta logging write --payload-type=struct my-test-log "{\"message\": \"My second entryi $b\", \"weather\": \"aaaaa\"}"
done
If you assume each command execution takes around 150ms at best, you can only write a handful of entries every second. You can try using the API directly to send the entries in batches. Unfortunately, the command line can currently only write one entry at a time. We will look into adding the capability to write multiple entries at a time.
If you want to stream large number of messages fast, you may want to look into Pub/Sub.