How to sync user directory on bitbucket server to jira with both running on aks? - kubernetes

When trying to sync the user directories of Jira to other atlassian products (confluence and bitbucket server running on aks) a 403 error is returned.
Upon looking into this error the following steps have been attempted:
https://confluence.atlassian.com/stashkb/unable-to-connect-to-jira-for-authentication-forbidden-403-323391874.html
The IP adresses have been added to the whitelist of Jira. The next step in solutions online is to restart the Jira
service.
This however causes issues as upon running the stop/start-jira.sh files inside the pod the service returns
with none of the previous settings and all configurations including backups are gone. Taking us back to square one.
cluster size:
current set-up
3 x Standard D8 v3 (8 vcpus, 32 GiB memory) cluster on aks
Used the following images installed through UI:
atlassian/jira-software
cptactionhank/docker-atlassian-jira
Exec into pod and go to /opt/atlassian/jira/bin
run ./(start/stop)-jira.sh
What should happen is that when going back to the url the Jira instance is reset and all configuration files in the pod for the service are lost.
The logs of the pod give error no 137 as a common error when restarting.
update:
https://github.com/int128/devops-kompose/tree/master/atlassian-jira-software
The following helm chart has also been used and achieved the same result.

Related

Container App Environment creation timing out

Where I work has just started migrating to the cloud. We've successfully deployed a number of resources using Terraform and Pipelines into Azure.
Where we are running into issues is deploying a Container App Environment, we have code that was working in a less locked down environment (setup for Proof of Concept), but are now having issues using that code in our go-forward.
When deploying, the Container App Environment spends 30mins attempting to create before it returns a context deadline exceeded error. Looking in Azure Portal, I can see the resource in "Waiting" provisioning state and I can also see the MC_ and AKS resources that get generated. It then fails around 4hrs later.
Any advice?
I am suspecting it's related to security on the Virtual Network that the subnets are sitting on, but I'm not seeing any logs on the deployment to confirm. The original subnets had a Network Security Group (NSG) assigned and I configured the rules that Microsoft provide before I added a couple of subnets without an NSG assigned and no luck.
My next step is to try provisioning it via the GUI and see if that works.
I managed to break our build in the "anything goes" environment.
The root cause is an incomplete configuration of the Virtual Network which has custom DNS entries. This has now been passed to our network architects to resolve. If I can get more details on the fix they apply I'll include that here for anyone else that runs into the issue.

Connection from VS Code to Kubernetes failing

I am receiving an error message when trying to access details from VS Code of my Azure Kubernetes Cluster. This problem prevents me from attaching a debugger to the pod.
I receive the following error message:
Error loading document: Error: cannot open k8smsx://loadkubernetescore/pod-kube-system%20%20%20coredns-748cdb7bf4-q9f9x.yaml?ns%3Dall%26value%3Dpod%2Fkube-system%20%20%20coredns-748cdb7bf4-q9f9x%26_%3D1611398456559. Detail: Unable to read file 'k8smsx://loadkubernetescore/pod-kube-system coredns-748cdb7bf4-q9f9x.yaml?ns=all&value=pod/kube-system coredns-748cdb7bf4-q9f9x&_=1611398456559' (Get command failed: error: there is no need to specify a resource type as a separate argument when passing arguments in resource/name form (e.g. 'kubectl get resource/<resource_name>' instead of 'kubectl get resource resource/<resource_name>'
)
My Setup
I have VS Code installed, with "Kubernetes", "Bridge to Kubernetes" and "Azure Kubernetes Service" installed
I have connected my Cluster through az login and can already access different information (e.g. my nodes, etc.)
When trying to access the workloads / pods on my cluster, I receive the above error message - and in the Kubernetes View in VS Code I get an error for the details of the pod.
Error in Kubernetes-View in VS Code
What I tried
I tried to reinstall the AKS Cluster and completely logging in freshly to it
I tried to reinstall all extensions mentioned above in VS Code
Browsing the internet, I do not find any comparable error message
The strange thing is that it used to work two weeks ago - and I did not change or update anything (as far as I remember)
Any ideas / hints that I can try further?
Thank you
As #mdaniel wrote: the Node view is just for human consumption, and that the tree item you actually want to click on is underneath Namespaces / kube-system / coredns-748cdb7bf4-q9f9x. Give that a try, and consider reporting your bad experience to their issue tracker since it looks like release 1.2.2 just came out 2 days ago and might not have been tested well.
final solution is to attach debugger in the other way - through Workloads / Deployments.

Why would running a container on GCE get stuck Metadata request unsuccessful forbidden (403)

I'm trying to run a container in a custom VM on Google Compute Engine. This is to perform a heavy ETL process so I need a large machine but only for a couple of hours a month. I have two versions of my container with small startup changes. Both versions were built and pushed to the same google container registry by the same computer using the same Google login. The older one works fine but the newer one fails by getting stuck in an endless list of the following error:
E0927 09:10:13 7f5be3fff700 api_server.cc:184 Metadata request unsuccessful: Server responded with 'Forbidden' (403): Transport endpoint is not connected
Can anyone tell me exactly what's going on here? Can anyone please explain why one of my images doesn't have this problem (well it gives a few of these messages but gets past them) and the other does have this problem (thousands of this message and taking over 24 hours before I killed it).
If I ssh in to a GCE instance then both versions of the container pull and run just fine. I'm suspecting the INTEGRITY_RULE checking from the logs but I know nothing about how that works.
MORE INFO: this is down to "restart policy: never". Even a simple Centos:7 container that says "hello world" deployed from the console triggers this if the restart policy is never. At least in the short term I can fix this in the entrypoint script as the instance will be destroyed when the monitor realises that the process has finished
I suggest you try creating a 3rd container that's focused on the metadata service functionality to isolate the issue. It may be that there's a timing difference between the 2 containers that's not being overcome.
Make sure you can ‘curl’ the metadata service from the VM and that the request to the metadata service is using the VM's service account.

How to use the Python Kubernetes client in a way resilient to GKE Kubernetes Master disruptions?

We sometimes use Python scripts to spin up and monitor Kubernetes Pods running on Google Kubernetes Engine using the Official Python client library for kubernetes. We also enable auto-scaling on several of our node pools.
According to this, "Master VM is automatically scaled, upgraded, backed up and secured". The post also seems to indicate that some automatic scaling of the control plane / Master VM occurs when the node count increases from 0-5 to 6+ and potentially at other times when more nodes are added.
It seems like the control plane can go down at times like this, when many nodes have been brought up. In and around when this happens, our Python scripts that monitor pods via the control plane often crash, seemingly unable to find the KubeApi/Control Plane endpoint triggering some of the following exceptions:
ApiException, urllib3.exceptions.NewConnectionError, urllib3.exceptions.MaxRetryError.
What's the best way to handle this situation? Are there any properties of the autoscaling events that might be helpful?
To clarify what we're doing with the Python client is that we are in a loop reading the status of the pod of interest via read_namespaced_pod every few minutes, and catching exceptions similar to the provided example (in addition we've tried also catching exceptions for the underlying urllib calls). We have also added retrying with exponential back-off, but things are unable to recover and fail after a specified max number of retries, even if that number is high (e.g. keep retrying for >5 minutes).
One thing we haven't tried is recreating the kubernetes.client.CoreV1Api object on each retry. Would that make much of a difference?
When a nodepool size changes, depending on the size, this can initiate a change in the size of the master. Here are the nodepool sizes mapped with the master sizes. In the case where the nodepool size requires a larger master, automatic scaling of the master is initiated on GCP. During this process, the master will be unavailable for approximately 1-5 minutes. Please note that these events are not available in Stackdriver Logging.
At this point all API calls to the master will fail, including the ones from the Python API client and kubectl. However after 1-5 minutes the master should be available and calls from both the client and kubectl should work. I was able to test this by scaling my cluster from 3 node to 20 nodes and for 1-5 minutes the master wasn't available .
I obtained the following errors from the Python API client:
Max retries exceeded with url: /api/v1/pods?watch=False (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at>: Failed to establish a new connection: [Errno 111] Connection refused',))
With kubectl I had :
“Unable to connect to the server: dial tcp”
After 1-5 minutes the master was available and the calls were successful. There was no need to recreate kubernetes.client.CoreV1Api object as this is just an API endpoint.
According to your description, your master wasn't accessible after 5 minutes which signals a potential issue with your master or setup of the Python script. To troubleshoot this further on side while your Python script runs, you can check for availability of master by running any kubectl command.

Service Fabric stateful service no longer replicates

FURTHER UPDATE: this error has not occurred since the November update.
EDIT: you may want to read this if your stateful service stops working for no apparent reason. Typical sign is using WordCount-like app (for example), the service deployment reports that one partition is remaining and after 5 tries gives up. The stateless service starts ok. The diagnostics reports multiple "Constructed instance of type WordCountService". If You have this, then you may have the same problem I have. No amount of uninstalling VS/SF/Azure SDKs helps. I now use a VM template with VS/Azure/SF installed and just delete and recreate it each time this error occurs (it is rare but has happened several times). Assume MSFT is aware and fixing for beta.
ORIGINAL:
Summary question: Is there a way to reset Service Fabric completely?
Background: I have a stateful/stateless app service based on Wordcount example. All of a sudden, after deployment the app no longer replicates the stateful service (1 instance, 2 replicas). The stateless service is deployed ok (one instance, no replicas).
The partition status of the primary partition is reporting "Partition is below target replica or instance count". The replica status is "InBuild" for replicas, Primary is OK.
On the primary node, there is a warning "Replica had multiple failures during open. Error = -2147024894.
I have tried cleaning the cluster, uninstalling/reinstalling service fabric, deleting the SfDevCluster directory entirely etc.
If I copy the exact code to another computer with service fabric installed, it works (and I mean copy/paste the whole solution directory).
I had a similar problem last week but it caused the host service not to start. Tried uninstall/reinstall/clean/remove SDKs, remove Visual Studio, etc. The only thing that fixed it was a reinstall of windows.