How to test the validity of alertmanager.yaml - kubernetes

Is there a way to find why my alert manager configurations are not being applied?
from the doc, the reason for that is that the file is not valid.
Alertmanager can reload its configuration at runtime. If the new configuration is not well-formed, the changes will not be applied and an error is logged. A configuration reload is triggered by sending a SIGHUP to the processor sending an HTTP POST request to the /-/reload endpoint.
Blockquote
I am trying to find a way to test the validity of my alertmanager.yaml
I came through the amtool git repo then it takes me through installing the alert manager itself where amtool is included inside.
Ok, I did that and I got the amtool.exe downloaded inside the alert manager package.
I added my file in the config folder that is supposed to be scanned by the tool. But I didn't get an answer from it, it keeps shutting its console screen without showing any log.
Second,
I have installed the Prometheus stack Helm chart on my k8s cluster from the Prometheus stack community git repo, how do I find the amtool inside?
No code sample it is not a code issue.
Thanks everyone.

Related

`gcloud run deploy` raises "Revision <revision_name> is not ready and cannot serve traffic."

Command
gcloud run deploy api --region=$REGION --image=$IMAGE
Logs
Deploying container to Cloud Run service [api] in project [[MASKED]] region [[MASKED]]
Deploying...
Creating Revision...........interrupted
Deployment failed
ERROR: (gcloud.run.deploy) Revision [[MASKED]] is not ready and cannot serve traffic.
I've tried to search Google Cloud documentation, but it does not mention such problem.
How to solve the "Revision is not ready and cannot serve traffic."?
Try to wait a few minutes and then just re-launch the procedure. The good old "let's retry without changing anything" worked for me! :)
EDIT: I talked with a Cloud Architect who works with me and he told me that this is the actual solution, because if you retry too quickly to restart the deploy, GCP may still have some pending operations from the previous one!
I faced the same error in Cloud Run after getting the container working correctly locally. In my case the revisions weren't showing as failing, they had a grey checkmark
and when hovering I got the message
The revision is healthy but not currently serving traffic.
I just needed to click Manage Traffic and set 100% of the traffic to a new revision
I faced this problem as well. In my case I checked "Cloud Run" section from hamburger menu of google cloud console. The "Logs" section should give you more idea about what went wrong. I was missing a python library, and adding correct python dependency in my requirements.txt solved the issue for me. Somehow my local testing went well without this issue. I hope this helps. :)
I faced with this problem, my problem is that my docker image is missing required dependency package at build stage, my Dockerfile missed some steps to copy required files for preparing to install package.
To find you problem if cloud build logs was not make sense for you, I think you should:
From gcloud console, go to service "Container Registry" > Images
Select your repository name
From the image version (maybe latest) that you want to check > more actions > show pull command > then copy that command ex: docker pull gcr.io/..
From gcloud console header > select activate cloud shell
At cloud shell terminal, pull docker images of your latest build by running "pull command" that you copied before.
Start your container from this image to see what exactly happens with your run revision

Dags in Airflow UI do not show on/off and keep on loading

I installed airflow using Kubernetes and have login the airflow UI. It shows all dags, but they are not shown correctly.
1/ There is no on/off buttons on the left of Dag name, it just show empty checkbox.
2/ The "Recent Tasks" and "DAG Runs" columns look like they are trying to load something;
3/ If I click and therefore go to any of DAG, it looks like it tries to load something;
I tried both airflow 2.0.0 and 1.10.11 and they show the same so it is not because of version.
What is the problem of the airflow and how to fix that?
------- here I provide more information according to Ofek Hod's suggestion:
1/ run "kubectl logs <pod_id> webserver", after I login airflow web UI, I got many http 404 response. e.g.
after I click any dag in airflow WebUI I got some other 404 response
Airflow interprets all your .py files in your dags folder first, I guess something goes wrong there.
As a rule-of-thumb, first access webserver and scheduler logs (for kub kubectl logs?), maybe you can find a hint there.
If not, try first to make a "clean" airflow instance without any of your dags code or related .py files- point the dags folder to an empty directory and see what happens (better if you turn on example dags configuration).
If that works, add your .py files from the original dag folder incrementally until you find problematic code.
If it's not working, probably the scheduler or webserver are messed up, please check the logs again with better attention.
find the answer myself:
The package I use to setup k8s-airflow has a step to run ./airflow/www/compile_assets.sh using npm but the package missed the step to install npm. so I added "apt install -y npm" in the step and now I see airflow page correctly.

Does anyone have tried the HLF 2.0 feature "External Builders and Launchers" and wants to get in touch?

I'm getting my way through the HLF 2.0 docs and would love to discuss and try out the new features "External Builders and Launchers" and "Chaincode as an external service".
My goal is to run HLF2.0 on an K8s cluster (OpenShift). Does anyone wants to get in touch or has anyone already figured his way through?
Cheers from Germany
Also trying to use the ExternalBuilder. Setup core.yaml, rebuilt the containers to use it. I get an error that on "peer lifecycle chaincode install .tgz...", that the path to the scripts in core.yaml can not be found.
I've added volume bind commands in the peer-base.yaml, and in docker-compose-cli.yaml, and am using the first-network setup. Dropped out the part of the byfn.sh that would connect to the cli container, so that I do that part manually, do the create, join, update anchors successfully, and then try to do the install and fail. However, on the install, I'm failing on the /bin/detect, because it can't find that file to fork/exec it. To get that far, peer was able to read my external configuration, and read the core.yaml file. At the moment, trying the "mode: dev" in the core.yaml which seems to indicate that the scripts and the chaincode will be run "locally", which I think means it should run in the cli container. Otherwise, tried to walk the code to see how the docker containers are being created dynamically, and from what image, but haven't been able to nail that down yet.

Azure Function - Publishing Failed - RequestTimeout

I have a basic Azure Function app. When I try to publish the app, I receive an error that says "error : The attempt to publish the ZIP file through https://... failed with HTTP status code RequestTimeout.".
This app is a .NET Standard app. I followed the instructions here. The difference is, my app has an Event Hub Trigger instead of the Http Trigger shown in the documentation. I don't understand why i'm getting a Timeout during deployment. I also don't know how to get past this.
What am I doing wrong?
Update
Here are the logs.
1>------ Build started: Project: MyProject.Functions, Configuration: Release Any CPU ------
1>MyProject.Functions -> C:\MyProject\MyProject.Functions\bin\Release\netcoreapp2.1\bin\MyProject.Functions.dll
========== Build: 1 succeeded, 0 failed, 0 up-to-date, 0 skipped ==========
Publish Started
MyProject.Functions -> C:\MyProject\MyProject.Functions\bin\Release\netcoreapp2.1\bin\MyProject.Functions.dll
MyProject.Functions -> C:\MyProject\MyProject.Functions\obj\Release\netcoreapp2.1\PubTmp\Out\
Publishing C:\MyProject\MyProject.Functions\obj\Release\netcoreapp2.1\PubTmp\MyProject.Functions - 20181101105531356.zip to https://my-project.scm.azurewebsites.net/api/zipdeploy...
C:\Users\me\.nuget\packages\microsoft.net.sdk.functions\1.0.23\build\netstandard1.0\Microsoft.NET.Sdk.Functions.Publish.ZipDeploy.targets(42,5): error : The attempt to publish the ZIP file through https://my-project.scm.azurewebsites.net/api/zipdeploy failed with HTTP status code RequestTimeout. [C:\MyProject\MyProject.Functions\MyProject.Functions.csproj]
According to this:
https://github.com/projectkudu/kudu/wiki/Deploying-from-a-zip-file
you should be able to pass ?isAsync=true to the zipdeploy url (so it would be: 'https://my-project.scm.azurewebsites.net/api/zipdeploy?isAsync=true'
This requests resolves faster without a timeout and then you can grab the location header from the response, which you can poll to see the status of your deployment.
In my case this error was because of the version of packages in my .csproj file. After updating them there was not error and the publish was successful.
I faced this recently and spent 2 complete days trying to fix it. Tried most of the solutions suggested here and on other posts.
What finally worked for me is removing my Publish settings and creating a new one by uploading a brand new .PublishSettings file.
How to get .PublishSettings file?
On Azure Portal, on your Function App, click on "Get Publish Profile"
And will automatically start downloading it.
How to Upload Publish Profile?
When trying to Publish the project from Visual Studio, click on New -> Select "Import Profile"
And Browse your .PublishSettings file.
Then, just select this new profile (if it's not selected already), and click on Publish button as you would usually do.
In my case, it was an issue with two things:
1] Visual Studio and Azure are flaky. Timeouts in a working scenario are still somewhat regular, on a bad day happening about 50-75% of the time for me. This is with an 80mb function app, not super big and I have gigabit Internet.
2] Someone deleted the file share for the storage. I had to fix WEBSITE_CONTENTAZUREFILECONNECTIONSTRING to point to the right storage connection string, and I had to update WEBSITE_CONTENTSHARE to point to a valid file share name, which I had to create in the storage resource group matching WEBSITE_CONTENTAZUREFILECONNECTIONSTRING connection string.
If you are using a development and production function slot, I would suggest to make WEBSITE_CONTENTAZUREFILECONNECTIONSTRING and WEBSITE_CONTENTSHARE deployment slot settings, that way you can link to a production and development storage environment. This is especially handy if you are using tables or blob storage and don't want to have to prefix or suffix all your table names or keys. In my opinion these two settings should be slots by default.
Once I did these changes I could publish, still dealing with the intermittent timeouts.
The error messaging with Azure function publishing is bad to non-existant, with any kind of configuration or resource errors simply causing a timeout error.
I got the same issue when using Visual Studio. Very frustrating.
But then I just used the zip file that VS created and used
az functionapp deployment source config-zip -g <resource_group> -n \
<app_name> --src <zip_file_path>
to publish.
You can find more options in
https://learn.microsoft.com/en-us/azure/azure-functions/deployment-zip-push
I got the same issue recently.
I'm not sure if they are related, but it started working fine after updating the NuGet package "Microsoft.NET.Sdk.Functions" to v3.0.7.
Changing the profile to use WebDeploy was the only way i could update my Azure Function.
When downloading the Profiles from the Azure Portal, and importing to VS - i noticed it imported 2 profiles. 1 for Zip, and another for Web Deploy method for uploading.
Trying the Zip publish profile, failed, but the WebDeploy 2nd Profile - did work and update perfectly.

Error occurred while starting the build in Openshift 3

I have been trying to deploy a war file as an OpenShift project. The server used is jboss-webserver30-tomcat8. I have followed the below steps -
Put ROOT.war file under 'deployments' directory in local system.
Upload the changes in github.
Create a new JAVA project in OpenShift 3 and provide the github repository details.
No automatic build or deployment starts. On manually clicking on Start Build button, the below error is displayed:
An error occurred while starting the build. Reason: Error resolving
ImageStreamTag jboss-webserver30-tomcat8-openshift:1.2 in namespace
openshift: unable to find latest tagged image
Please suggest how can I resolve the error.
This is an issue with how the jboss-webserver30-tomcat8-openshift imagestream is defined in the cluster. We are working to correct this, it is not currently importing the correct set of tags and as a result the 1.2 tag was stopped being a valid tag, when it should be.
However the short term solution is change your buildconfig to reference one of the tags that has a valid image reference associated (e.g. 1.3) instead of the 1.2 tag it is currently referencing. Your build should then be able to run.
A (temporarily) unavailable builder image may be related to this platform upgrade that correlates with the time of posting your question.
Generally, the best place to check for any incident reports or scheduled maintenance is the Status Page (Starter | Pro clusters; it's linked in the web console too, in the upper right corner of the interface).
If this does not seem to be related (e.g. you're not on the starter-us-west-2 cluster where the platform upgrade is taking place) or persists after the maintenance is over, I would encourage you to check the open issues, and log a new bug report, if it's not in the list.
Thank you.