Automatically reconnect failed tasks in Kafka-Connect - apache-kafka

I'm using a mongo-source plugin with Kafka-connect.
I checked the source task state, and it was running and listening on a mongo collection.
I manually stopped mongod service and waited about 1 minute, then I start it back again.
I checked the source task to see if anything will fix itself, and after 30 minutes nothing seems to work.
Only after restarting the connector it started working again.
Since, mongo-source doesn't have the options to set retries + backoff when timeout, I searched for a configuration that will fit a simple scenario: restart failed task after X time using Kafka-connect configuration. couldn't find any.. :/
I can do that with a simple script, but there must be something in Kafka-connect that manages failed tasks. or even in mongo-source... I don't want it to fail so fast after just 1 minute... :/

There isn't any way other than using the REST API to find a failed task and submit a restart request - and then running this on a periodic basis. For example
curl -s "http://localhost:8083/connectors?expand=status" | \
jq -c -M 'map({name: .status.name } + {tasks: .status.tasks}) | .[] | {task: ((.tasks[]) + {name: .name})} | select(.task.state=="FAILED") | {name: .task.name, task_id: .task.id|tostring} | ("/connectors/"+ .name + "/tasks/" + .task_id + "/restart")' | \
xargs -I{connector_and_task} curl -v -X POST "http://localhost:8083"\{connector_and_task\}
Source: https://rmoff.net/2019/06/06/automatically-restarting-failed-kafka-connect-tasks/

Related

nats-sub not found after nats-box fresh install

I'm trying to setup a basic NATS service on my kubernetes cluster, according to their documentation, here. I executed the following code:
$ helm install test-nats nats/nats
NAME: test-nats
LAST DEPLOYED: Thu Jul 14 13:18:09 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
NOTES:
You can find more information about running NATS on Kubernetes
in the NATS documentation website:
https://docs.nats.io/nats-on-kubernetes/nats-kubernetes
NATS Box has been deployed into your cluster, you can
now use the NATS tools within the container as follows:
kubectl exec -n default -it deployment/test-nats-box -- /bin/sh -l
nats-box:~# nats-sub test &
nats-box:~# nats-pub test hi
nats-box:~# nc test-nats 4222
Thanks for using NATS!
$ kubectl exec -n default -it deployment/test-nats-box -- /bin/sh -l
_ _
_ __ __ _| |_ ___ | |__ _____ __
| '_ \ / _` | __/ __|_____| '_ \ / _ \ \/ /
| | | | (_| | |_\__ \_____| |_) | (_) > <
|_| |_|\__,_|\__|___/ |_.__/ \___/_/\_\
nats-box v0.11.0
test-nats-box-84c48d46f-j7jvt:~#
Now, so far, everything has conformed to their start guide. However, when I try to test the connection, I run into trouble:
test-nats-box-84c48d46f-j7jvt:~# nats-sub test &
test-nats-box-84c48d46f-j7jvt:~# /bin/sh: nats-sub: not found
test-nats-box-84c48d46f-j7jvt:~# nats-pub test hi
/bin/sh: nats-pub: not found
It looks like the commands weren't found but they should have been installed when I did the helm install. What's going on here?
I have reproduced the set-up on my kubernetes cluster and have successfully deployed the nats box and started a client subscriber program in which subscribers listen on subjects, and publishers send messages on specific subjects.
1. Create Subscriber
In a shell or command prompt session, start a client subscriber program.
nats sub < subject>
Here, < subject > is a subject to listen on. It helps to use unique and well thought-through subject strings because you need to ensure that messages reach the correct subscribers even when wildcards are used.
For example:
nats sub msg.test
You should see the message: Listening on [msg.test].
2. Create a Publisher and publish a message
Create a NATS publisher and send a message.
nats pub < subject> < message>
Where < subject> is the subject name and < message> is the text to publish.
For example:
nats pub msg.test nats-message-1
You'll notice that the publisher sends the message and prints: Published [msg.test] : 'NATS MESSAGE'.
The subscriber receives the message and prints: [#1] Received on [msg.test]: 'NATS MESSAGE'.
Here, you have provided the wrong syntax nats-sub and nats-pub which are deprecated. Try using the above commands to give precise results.
I had the same problem. nats-sub and nats-pub seem to be deprecated and you need to use nats sub and nats pub instead.

How to wait until a job is done or a file is updated in airflow

I am trying to use apache-airflow, with google cloud-composer, to shedule batch processing that result in the training of a model with google ai platform. I failed to use airflow operators as I explain in this question unable to specify master_type in MLEngineTrainingOperator
Using the command line I managed to launch a job successfully.
So now my issue is to integrate this command in airflow.
Using BashOperator I can train the model but I need to wait for the job to be completed before creating a version and setting it as the default. This DAG create a version before the job is done
bash_command_train = "gcloud ai-platform jobs submit training training_job_name " \
"--packages=gs://path/to/the/package.tar.gz " \
"--python-version=3.5 --region=europe-west1 --runtime-version=1.14" \
" --module-name=trainer.train --scale-tier=CUSTOM --master-machine-type=n1-highmem-16"
bash_train_operator = BashOperator(task_id='train_with_bash_command',
bash_command=bash_command_train,
dag=dag,)
...
create_version_op = MLEngineVersionOperator(
task_id='create_version',
project_id=PROJECT,
model_name=MODEL_NAME,
version={
'name': version_name,
'deploymentUri': export_uri,
'runtimeVersion': RUNTIME_VERSION,
'pythonVersion': '3.5',
'framework': 'SCIKIT_LEARN',
},
operation='create')
set_version_default_op = MLEngineVersionOperator(
task_id='set_version_as_default',
project_id=PROJECT,
model_name=MODEL_NAME,
version={'name': version_name},
operation='set_default')
# Ordering the tasks
bash_train_operator >> create_version_op >> set_version_default_op
The training result in updating of a file in Gcloud storage. So I am looking for an operator or a sensor that will wait until this file is updated, I noticed GoogleCloudStorageObjectUpdatedSensor, but I dont know how to make it retry until this file is updated.
An other solution would be to check for the job to be completed, but I can't find how too.
Any help would be greatly appreciated.
The Google Cloud documentation for the --stream-logs flag:
"Block until job completion and stream the logs while the job runs."
Add this flag to bash_command_train and I think it should solve your problem. The command should only release once the job is finished, then Airflow will mark it as success. It will also let you monitor your training job's logs in Airflow.

How to run e2e tests on custom cluster within Kubernetes.

https://github.com/kubernetes/community/blob/master/contributors/devel/e2e-tests.md#testing-against-local-clusters
I have been following the above guide, but I keep getting this error:
2017/07/12 09:53:58 util.go:131: Step './cluster/kubectl.sh version --match-server-version=false' finished in 20.604745ms
2017/07/12 09:53:58 util.go:129: Running: ./hack/e2e-internal/e2e-status.sh
WARNING: The bash deployment for AWS is obsolete. The
v1.5.x releases are the last to support cluster/kube-up.sh with AWS.
For a list of viable alternatives, (...)
2017/07/12 09:53:58 util.go:131: Step './hack/e2e-internal/e2e-status.sh' finished in 18.71843ms
2017/07/12 09:53:58 main.go:216: Something went wrong: encountered 2 errors: [error during ./cluster/kubectl.sh version --match-server-version=false: exit status 1 error during ./hack/e2e-internal/e2e-status.sh: exit status 1]
2017/07/12 09:53:58 e2e.go:78: err: exit status 1
How do I fix this, what am I doing wrong?
If you just want to execute e2e tests without setting up the whole cluster, you can compile them from kubernetes repository: make all WHAT=test/e2e/e2e.test, and then run this compiled e2e binary against your cluster: ./e2e.test --host="<your apiserver>" --provider=local --kubeconfig=<kubeconfig location> -ginkgo.Focus="/[Conformance/]". Conformance tests should pass for any kubernetes cluster, but of course you can set any filter you want. To list all available tests, type: ./e2e.test --ginkgo.DryRun.
Some supplements
You can also compile ginkgo:
make WHAT=vendor/github.com/onsi/ginkgo/ginkgo
Some options are useful:(ginkgo --help to see details)
-flakeAttempts
-focus
-nodes
-outputdir
-skip
-v
To run tests parellely:(set --node=1 for serial tests)
./_output/bin/ginkgo --nodes=25 --flakeAttempts=2 \
./_output/bin/e2e.test -- --host="http://127.0.0.1:8080" \
--provider="local" --ginkgo.v=true --kubeconfig="~/.kube/config" \
--ginkgo.focus="Conformance" --ginkgo.skip="Serial|Slow" \
--ginkgo.failFast=false
And if you want to launch local cluster for e2e testing, hack/local-up-cluster.sh is handy.

Custom Munin plugin won't report

I've built my first Munin plugin to give us the size of our Redis queue, but it won't report for some reason. Every other plugin on the node, including other Redis-centric plugins work fine.
Here's the plugin code:
#!/bin/sh
case $1 in
config)
cat <<'EOM'
multigraph redis_queue_size
graph_title Redis Queue Size
graph_info The size of Redis queue
graph_category redis
graph_vlabel Messages
redisqueue.label redisqueue
redisqueue.type GAUGE
redisqueue.min 0
EOM
exit 0;;
esac
queuelength=`redis-cli llen mykeyname`
printf "redisqueue.value "
echo $queuelength
The plugin is in /usr/share/munin/plugins/redis_queue_
The plugin is symlinked to /etc/munin/plugins/redis_queue_
I made sure to restart the service
$ sudo service munin-node force-reload
If I run sudo munin-run redis_queue_ I get the correct output:
redisqueue.value 1567595
If I run munin-node-config I get the following:
redis_queue_ | yes |
If I connect to the instance from the master using telnet to fetch the plugin, I get:
$ telnet 10.101.21.56 4949
Trying 10.101.21.56...
Connected to 10.101.21.56.
Escape character is '^]'.
# munin node at redis01.example.com
fetch redis_queue_
redisqueue.value 1035336
The master shows an empty graph for it, but the "last updated" time isn't increasing. I initially had the plugin configured a little differently (it wasn't producing good output) so all the values are -nan. Once I fixed the output, I expected the plugin to start working, but all efforts have failed.
Everything looks right, but yet still no values in the graph.
Edit: Munin v1.4.6

Hector test example not working on Cassandra 0.7.4

I have set up my single node Cassandra 0.7.4 and started the service with
bin/cassandra -f. Now I am trying to use the Hector API (v. 0.7.0) to manage the
DB.
The Cassandra CLI works fine and I can create keyspaces and so on.
I tried to run the test example and create a single keyspace:
Cluster cluster = HFactory.getOrCreateCluster("TestCluster",
new CassandraHostConfigurator("localhost:9160"));
Keyspace keyspace = HFactory.createKeyspace("Keyspace1", cluster);
But all I get is this:
2011-04-14 22:20:27,469 [main ] INFO
me.prettyprint.cassandra.connection.CassandraHostRetryService
- Downed Host
Retry service started with queue size -1 and retry delay 10s
2011-04-14 22:20:27,492 [main ] DEBUG
me.prettyprint.cassandra.connection.HThriftClient -
Transport open status false
for client CassandraClient<localhost:9160-1>
....this again about 20 times
me.prettyprint.cassandra.service.JmxMonitor - Registering JMX
me.prettyprint.cassandra.service_TestCluster:ServiceType=hector,
MonitorType=hector
2011-04-14 22:20:27,636 [Thread-0 ] INFO
me.prettyprint.cassandra.connection.CassandraHostRetryService -
Downed Host
retry shutdown hook called
2011-04-14 22:20:27,646 [Thread-0 ] INFO
me.prettyprint.cassandra.connection.CassandraHostRetryService -
Downed Host
retry shutdown complete
Can you please tell me what I'm doing wrong?
Thanks
When you connect via the CLI, do you specify "-h localhost -p 9160"?
Can you actually do stuff on the command line with the above?
The error from HThriftClient indicates it could not connect to the Cassandra Daemon.
FTR, you would get responses much faster via hector-users#googlegroups.com
If you are on a linux machine, try starting up your cassandra server by this command:
/bin$ ./cassandra start -f
Then for the cli, use this command:
./cassandra-cli -h {hostname}/9160.
Then make sure that the configures are ok.