Concourse Worker on another server loses connection to Concourse Web - concourse

We have a Concourse Web Container and a Concourse Worker Container running on Server A (212.77.7.255 - real IP is conceiled). We use the latest Concourse Version 7.8.1.
As we ran out of Worker resources, we added another Concourse Worker Container running on Server B. The Worker on Server B has been running fine for about five days, but all of a sudden it is not able to connect anymore to Concourse Web on Server A.
The logs of the Worker on Server B say:
{
"timestamp": "2022-07-12T11:15:59.542 985762Z",
"level": "error",
"source": "worker",
"message": "worker.container-sweeper.tick.failed-to-connect-to-tsa",
"data": {
"error": "dial tcp 212.77.7.255:2222: i/o timeout",
"session": "6.4"
}
}{
"timestamp": "2022-07-12T11:15:59.5430446562",
"level": "error",
"source": "worker",
"message": "worker.container-sweeper.tick.dial.failed-to-connect-to-any-tsa",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "6.4.2"
}
}{
"timestamp": "2022-07-12T11:15:59.5430608042",
"level": "error",
"source": "worker",
"message": "worker.container-sweeper.tick.failed-to-dial",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "6.4"
}
}{
"timestamp": "2022-07-12T11:15:59.5430689532",
"level": "error",
"source": "worker",
"message": "worker.container-sweeper.tick.failed-to-get-containers-to-destroy",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "6.4"
}
}{
"timestamp": "2022-07-12T11:15:59.5541187512",
"level": "error",
"source": "worker",
"message": "worker.volume-sweeper. tick.failed-to-connect-to-tsa",
"data": {
"error": "dial tcp 212.77.7.255:2222: i/o timeout",
"session": "7.4"
}
}{
"timestamp": "2022-07-12T11:15:59.5541648442",
"level": "error",
"source": "worker",
"message": "worker.volume-sweeper.tick.dial.failed-to-connect-to-any-tsa",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "7.4.3"
}
}{
"timestamp": "2022-07-12T11:15:59.5541725932",
"level": "error",
"source": "worker",
"message": "worker.volume-sweeper.tick.failed-to-dial",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "7.4"
}
}{
"timestamp": "2022-07-12T11:15:59.554179789Z",
"level": "error",
"source": "worker",
"message": "worker.volume-sweeper. tick. failed-to-get-volume 3-to-destroy",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "7.4"
}
}{
"timestamp": "2022-07-12T11:16:04.5802200122",
"level": "error",
"source": "worker",
"message": "worker.beacon-runner.beacon. failed-to-connect-to-tsa",
"data": {
"error": "dial tcp 212.77.7.255:2222: i/o timeout",
"session": "4.1"
}
}{
"timestamp": "2022-07-12T11:16:04.580284659Z",
"level": "error",
"source": "worker",
"message": "worker.beacon-runner.beacon.dial.failed-to-connect-to-any-tsa",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "4.1.10"
}
}{
"timestamp": "2022-07-12T11:16:04.5803353772",
"level": "error",
"source": "worker",
"message": "worker.beacon-runner.beacon.failed-to-dial",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "4.1"
}
}{
"timestamp": "2022-07-12T11:16:04.5803598682",
"level": "error",
"source": "worker",
"message": "worker.beacon-runner.beacon.exited-with-error",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "4.1"
}
}{
"timestamp": "2022-07-12T11:16:04.580372552Z",
"level": "debug",
"source": "worker",
"message",
"worker.beacon-runner.beacon.done",
"data": {
"session": "4.1"
}
}{
"timestamp": "2022-07-12T11:16:04.5803948792",
"level": "error",
"source": "worker",
"message": "worker.beacon-runner.failed",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "4"
}
}
The logs on Concourse Web on Server A show no entries of the Worker on Server B trying to connect. On Server B I'm able to connect to Concourse Web on Server A:
$ nc 212.77.7.255 2222
SSH-2.0-Go
We had this problem before, but we solved it by upgrading Concourse to the latest version 7.8.1. Now I'm running out of options where to debug this. What I've tried:
restarting the workers
restarting the web container
pruning the stalled worker of Server B
docker system prune on Server B
Nothing does help. What can I do to debug this further and make the Worker on Server B connect again?

You said it happened to an earlier version, you "ran out of Worker resources", and I'm seeing I/O timeout in the logs... the one component you didn't mention is the DB.
It might be that the max conns on the DB has been reached, especially if the DB is used for purposes other than just Concourse. That's where I'd look next.

We couldn't find out why the docker network did not allow connecting to Server A. As connections on the host machine were going through, we told docker to use the host network:
services:
concourse-worker:
...
network-mode: host
...
This solved the issue. Not a pretty workaround, as the docker container should have it's own separated network, but as there is nothing else running on this server it's fine.

Related

Azure data factory ci/cd release errors

I am trying to deploy to the UAT environment and followed the needed steps shown in a blog post and youtube video.
However, I keep getting failures.
If I run it in 'validation only' mode it passes fine. But to actually deploy it under 'incremental' I receive the following errors
2022-03-01T17:24:27.6268601Z ##[error]At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/DeployOperations for usage details.
2022-03-01T17:24:27.6301466Z ##[error]Details:
2022-03-01T17:24:27.6306699Z ##[error]BadRequest: Failed to encrypt sub-resource payload {
"Id": "/subscriptions/a834838a-11d5-4657-a9c3-bc8b2ebdaa59/resourceGroups/adf-rg-uat-uks/providers/Microsoft.DataFactory/factories/adf-df-uat-uks/linkedservices/adfSQLDB_Dev",
"Name": "adfSQLDB_Dev",
"Properties": {
"annotations": [],
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "********************"
}
}
} and error is: Message for the errorCode not found..
2022-03-01T17:24:27.6314634Z ##[error]BadRequest: Failed to encrypt sub-resource payload {
"Id": "/subscriptions/a834838a-11d5-4657-a9c3-bc8b2ebdaa59/resourceGroups/adf-rg-uat-uks/providers/Microsoft.DataFactory/factories/adf-df-uat-uks/linkedservices/adfSQLDB_Prod",
"Name": "adfSQLDB_Prod",
"Properties": {
"annotations": [],
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "********************"
}
}
} and error is: Message for the errorCode not found..
2022-03-01T17:24:27.6322501Z ##[error]BadRequest: Failed to encrypt sub-resource payload {
"Id": "/subscriptions/a834838a-11d5-4657-a9c3-bc8b2ebdaa59/resourceGroups/adf-rg-uat-uks/providers/Microsoft.DataFactory/factories/adf-df-uat-uks/linkedservices/adfBlobStorage",
"Name": "adfBlobStorage",
"Properties": {
"annotations": [],
"type": "AzureBlobStorage",
"typeProperties": {
"connectionString": "********************"
}
}
} and error is: Expecting connection string of format "key1=value1; key2=value2"..
2022-03-01T17:24:27.6328134Z ##[error]BadRequest: Failed to encrypt sub-resource payload {
"Id": "/subscriptions/a834838a-11d5-4657-a9c3-bc8b2ebdaa59/resourceGroups/adf-rg-uat-uks/providers/Microsoft.DataFactory/factories/adf-df-uat-uks/linkedservices/adfSQLDB",
"Name": "adfSQLDB",
"Properties": {
"annotations": [],
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "********************"
}
}
} and error is: Message for the errorCode not found..
2022-03-01T17:24:27.6336169Z ##[error]BadRequest: Failed to encrypt sub-resource payload {
"Id": "/subscriptions/a834838a-11d5-4657-a9c3-bc8b2ebdaa59/resourceGroups/adf-rg-uat-uks/providers/Microsoft.DataFactory/factories/adf-df-uat-uks/linkedservices/adfSQLDB_UAT",
"Name": "adfSQLDB_UAT",
"Properties": {
"annotations": [],
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "********************"
}
}
} and error is: Message for the errorCode not found..
2022-03-01T17:24:27.6339122Z ##[error]Check out the troubleshooting guide to see if your issue is addressed: https://learn.microsoft.com/en-us/azure/devops/pipelines/tasks/deploy/azure-resource-group-deployment?view=azure-devops#troubleshooting
2022-03-01T17:24:27.6342204Z ##[error]Task failed while creating or updating the template deployment.
Regards
Mark
Are you using any Self Hosted Integration Runtime? If yes, then please check it is in available state during the deployment or your connection string may not have the Secure String property:
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": .............
}
}

PostgreSQL internal server 500

Iam getting a 500 internal server error with this message,
"message": "no pg_hba.conf entry for host "98.222.142.89", user "ykqkrdvlmsxktt", database "dp1g7eegdjct9ssl=true&sslfactory=org.postgresql.ssl.NonValidati", SSL off",
"error": {
"length": 216,
"name": "error",
"severity": "FATAL",
"code": "28000",
"file": "auth.c",
"line": "520",
"routine": "ClientAuthentication"
}
}

What is my Custom Resource Definition URL in Kubernetes

I am trying to hit my custom resource definition endpoint in Kubernetes but cannot find an exact example for how Kubernetes exposes my custom resource definition in the Kubernetes API. If I hit the custom services API with this:
https://localhost:6443/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions
I get back this response
"items": [
{
"metadata": {
"name": "accounts.stable.ibm.com",
"selfLink": "/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions/accounts.stable.ibm.com",
"uid": "eda9d695-d3d4-11e9-900f-025000000001",
"resourceVersion": "167252",
"generation": 1,
"creationTimestamp": "2019-09-10T14:11:48Z",
"deletionTimestamp": "2019-09-12T22:26:20Z",
"finalizers": [
"customresourcecleanup.apiextensions.k8s.io"
]
},
"spec": {
"group": "stable.ibm.com",
"version": "v1",
"names": {
"plural": "accounts",
"singular": "account",
"shortNames": [
"acc"
],
"kind": "Account",
"listKind": "AccountList"
},
"scope": "Namespaced",
"versions": [
{
"name": "v1",
"served": true,
"storage": true
}
],
"conversion": {
"strategy": "None"
}
},
"status": {
"conditions": [
{
"type": "NamesAccepted",
"status": "True",
"lastTransitionTime": "2019-09-10T14:11:48Z",
"reason": "NoConflicts",
"message": "no conflicts found"
},
{
"type": "Established",
"status": "True",
"lastTransitionTime": null,
"reason": "InitialNamesAccepted",
"message": "the initial names have been accepted"
},
{
"type": "Terminating",
"status": "True",
"lastTransitionTime": "2019-09-12T22:26:20Z",
"reason": "InstanceDeletionCheck",
"message": "could not confirm zero CustomResources remaining: timed out waiting for the condition"
}
],
"acceptedNames": {
"plural": "accounts",
"singular": "account",
"shortNames": [
"acc"
],
"kind": "Account",
"listKind": "AccountList"
},
"storedVersions": [
"v1"
]
}
}
]
}
This leads me to believe I have correctly created the custom resource accounts. There are a number of examples that don't seem to be quite right and I cannot find my resource in the Kubernetes REST api. I can use with my custom resource from kubectl but I need to expose it with RESTful APIs.
https://localhost:6443/apis/stable.example.com/v1/namespaces/default/accounts
returns
404 page not found
Where as:
https://localhost:6443/apis/apiextensions.k8s.io/v1beta1/apis/stable.ibm.com/namespaces/default/accounts
returns
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "the server could not find the requested resource",
"reason": "NotFound",
"details": {},
"code": 404
}
I have looked at https://docs.okd.io/latest/admin_guide/custom_resource_definitions.html and https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/
The exact URL would be appreciated.
This is a quite decent way retrieving K8s REST API resource executing kubectl get command on some top debugging levels, like #Suresh Vishnoi mentioned in the comment:
kubectl get <api-resource> -v=8
Apparently, eventually checked by #Amit Kumar Gupta, the correct URL accessing custom resource as per your CRD json output is the following:
https://<API_server>:port/apis/stable.ibm.com/v1/namespaces/default/accounts
Depending on the authentication method you may choose: X509 Client Certs, Static Token File, Bearer Token or HTTP API proxy in order to authenticate user requests against Kubernetes API.

"Error getting chaincode package bytes" when deploying chaincode on hyperledger via REST

I'm trying to deploy chaincode on hyperledger (Bluemix service) via POST/REST to
/chaincode
QuerySpec
{ "jsonrpc": "2.0", "method": "deploy", "params": { "type": 1,
"chaincodeID": { "path":
"https://github.com/romeokienzler/learn-chaincode/tree/master/finished"
}, "ctorMsg": { "function": "init", "args": [ "hi there" ] },
"secureContext": "user_type1_0" }, "id": 1 }
I've also tried those links
https://github.com/romeokienzler/learn-chaincode/blob/master/finished/chaincode_finished?raw=true
https://raw.githubusercontent.com/romeokienzler/learn-chaincode/master/finished/chaincode_finished.go
I always get
{ "jsonrpc": "2.0", "error": {
"code": -32001,
"message": "Deployment failure",
"data": "Error when deploying chaincode: Error getting chaincode package bytes: Error getting code 'go get' failed with error: 'exit
status 1'\npackage
github.com/romeokienzler/learn-chaincode/tree/master/finished: cannot
find package
'github.com/romeokienzler/learn-chaincode/tree/master/finished' in any
of:\n\t/usr/local/go/src/github.com/romeokienzler/learn-chaincode/tree/master/finished
(from
$GOROOT)\n\t/go/usercode/552962906/src/github.com/romeokienzler/learn-chaincode/tree/master/finished
(from
$GOPATH)\n\t/go/src/github.com/romeokienzler/learn-chaincode/tree/master/finished\n"
}, "id": 1 }
Any idea?
Considering that you are playing with Bluemix service, I assume you are following "Implementing your first chain code tutorial"
If your forked repository you will see instructions to use branch v1.0 for Bluemix Blockchain Services (link) IBM BMX Service is (still) using Fabric v0.5.
Once you have Registered with one of the available Enroll ID you should be able to deploy your chaincode using DeploySpec (note the path: "https://github.com/romeokienzler/learn-chaincode/tree/v1.0/finished")
{
"jsonrpc": "2.0",
"method": "deploy",
"params": {
"type": 1,
"chaincodeID": {
"path": "https://github.com/romeokienzler/learn-chaincode/tree/v1.0/finished"
},
"ctorMsg": {
"function": "init",
"args": [
"hi there"
]
},
"secureContext": "user_type1_0"
},
"id": 1
}
First of all deploy command should be changed to ( the value for path variable was changed):
{
"jsonrpc": "2.0",
"method": "deploy",
"params": {
"type": 1,
"chaincodeID": {
"path": "https://github.com/romeokienzler/learn-chaincode/finished"
},
"ctorMsg": {
"function": "init",
"args": ["hi there"]
},
"secureContext": "user_type1_0"
},
"id": 1
}
P.S. As #Mil4n correctly mentioned, IBM Bluemix still works with Fabric v0.5. Chaincode romeokienzler/learn-chaincode/finished should be adopted to this version.
For example shim.ChaincodeStubInterface is not available yet and should be replaced with *shim.ChaincodeStub.

presto: Discovery server cannot get connect

Recently I build presto with cluster mode, 1 coordinator & 1 worker, it works.
Then I repackage "presto-main-0.148.jar" without any change , and replace it to production environment, it doesn't work! Always get response with "No worker nodes available"
I search the Server.log and see below messages:
ERROR Discovery-0 io.airlift.discovery.client.CachingServiceSelector Cannot
connect to discovery server for refresh (collector/general): Lookup
of collector failed for
ht*p://10.3.2.33:18080/v1/service/collector/general
ERROR Discovery-0 io.airlift.discovery.client.CachingServiceSelector Cannot
connect to discovery server for refresh (presto/general): Lookup of
presto failed for ht*p://10.3.2.33:18080/v1/service/presto/general
INFO Discovery-1 io.airlift.discovery.client.CachingServiceSelector Discovery
server connect succeeded for refresh (collector/general)
INFO Discovery-2 io.airlift.discovery.client.CachingServiceSelector Discovery
server connect succeeded for refresh (presto/general)
So I guess discover server is not started,But I use command curl "h*tp://10.3.2.33:18080/v1/service/collector/general",
and get response below, and I also get coordinator status as 'ACTIVE'
{
"environment": "presto_**_flt",
"services": [
{
"id": "954e886d-7506-4f00-b954-eeab49209835",
"nodeId": "4c0f2596-7e6e-11e6-ae22-56b6b6499611",
"type": "presto",
"pool": "general",
"location": "/4c0f2596-7e6e-11e6-ae22-56b6b6499611",
"properties": {
"node_version": "a0e36ae",
"coordinator": "false",
"http": "h*tp://10.3.2.24:18080",
"http-external": "h*tp://10.3.2.24:18080",
"datasources": "hive,system"
}
},
{
"id": "6790b522-cd17-48ef-b077-e4e8fa97e310",
"nodeId": "4c0f2366-7e6e-11e6-ae22-56b6b6499611",
"type": "presto",
"pool": "general",
"location": "/4c0f2366-7e6e-11e6-ae22-56b6b6499611",
"properties": {
"node_version": "c34bef3-dirty",
"coordinator": "true",
"http": "h*tp://10.3.2.33:18080",
"http-external": "h*tp://10.3.2.33:18080",
"datasources": ""
}
}
]
}
I think this is because that you have two different node_version in these two services.
If you are repackaging presto-main or any other component, make sure you are using the same binaries on all the nodes.