K8s pods affinity & anti-affinity, soft (preferredDuringScheduling) not respected in 1.4? - kubernetes

I am experimenting with K8s 1.4 pod affinity-antiaffinity. I am trying to get K8s to cluster together pods of the same service on the same node as much as possible (i.e. only go to next node if not possible to put more on the node where the service is already in). To do so, I setup:
A hard (requiredDuringScheduling) anti-affinity to exclude running where a diffrent service is already running (pod_label_xyz not in [value-a])
A soft (preferredDuringScheduling) affinity to try to run where the same service (pod_label_xyz in [value-a]) - weight 100
A soft (preferredDuringScheduling) anti-affinity to try not to run where the same service is not already running (pod_label_xyz not present) - weight 100
When there is 5 nodes and 3 services pod_label_xyz & (value-a, value-b, value-c) with 1 pod each created with a replication controller, the first pods get scheduled properly and when scaling up any of them, the 1st hard rule is respected by K8s. However, the 2nd and 3rd (which is actually redundant of the 2nd) is not respected. I see that when I scale up K8s tries to push pods to empty node (not used by any other service) even though there is capacity to schedule more where the service is already running. In fact if I scale up even more, new pods get created on the original node as well as the new (previously unused nodes).
Please advise if I am missing something
Thank you
Here is the annotation i used
scheduler.alpha.kubernetes.io/affinity: >
{
"podAffinity":{
"preferredDuringSchedulingIgnoredDuringExecution":[
{
"weight":100,
"podAffinityTerm":{
"labelSelector":{
"matchExpressions":[
{
"key":"pod_label_xyz",
"operator":"Exists"
},
{
"key":"pod_label_xyz",
"operator":"In",
"values":[
"value-a"
]
}
]
},
"namespaces":[
"sspni-882-frj"
],
"topologyKey":"kubernetes.io/hostname"
}
}
]
},
"podAntiAffinity":{
"requiredDuringSchedulingIgnoredDuringExecution":[
{
"labelSelector":{
"matchExpressions":[
{
"key":"pod_label_xyz",
"operator":"Exists"
},
{
"key":"pod_label_xyz",
"operator":"NotIn",
"values":[
"value-a"
]
}
]
},
"namespaces":[
"sspni-882-frj"
],
"topologyKey":"kubernetes.io/hostname"
}
],
"preferredDuringSchedulingIgnoredDuringExecution":[
{
"weight":100,
"podAffinityTerm":{
"labelSelector":{
"matchExpressions":[
{
"key":"pod_label_xyz",
"operator":"DoesNotExist"
}
]
},
"namespaces":[
"sspni-882-frj"
],
"topologyKey":"kubernetes.io/hostname"
}
}
]
}
}

Related

Opensearch Failed to set number of replicas due no permissions

I have the problem with running index managment policy for new indices. I get following error on "set number_of_replicas" step:
{
"cause": "no permissions for [indices:admin/settings/update] and associated roles [index_management_full_access, own_index, security_rest_api_access]",
"message": "Failed to set number_of_replicas to 2 [index=sample.name-2022.10.22]"
}
The indices are created by logstash with "sample.name-YYYY.MM.DD" name template, so in the index policy I have "sample.name-*" index pattern.
My policy:
{
"policy_id": "sample.name-*",
"description": "sample.name-* policy ",
"schema_version": 16,
"error_notification": null,
"default_state": "set replicas",
"states": [
{
"name": "set replicas",
"actions": [
{
"replica_count": {
"number_of_replicas": 2
}
}
]
],
"ism_template": [
{
"index_patterns": [
"sample.name-*"
],
"priority": 1
}
]
}
I don't understand the reason of this error.
Am I doing something wrong?
Retry of the policy doesn't work.
The policy works only if I manually reassign it to index by Dashboards or API.
Opensearch version: 2.3.0
First time I created the policy using API under custom internal user with mapped “security_rest_api_access” security role only.
So I added all_access rights to my internal user and re-created policy and it works!
Seems that the policy runs under my internal user, which created it

CannotPullContainerError: failed to extract layer

I'm trying to run a task on a windows container in fargate mode on aws
The container is a .net console application (Fullframework 4.5)
This is the task definition generated programmatically by SDK
var taskResponse = await ecsClient.RegisterTaskDefinitionAsync(new Amazon.ECS.Model.RegisterTaskDefinitionRequest()
{
RequiresCompatibilities = new List<string>() { "FARGATE" },
TaskRoleArn = TASK_ROLE_ARN,
ExecutionRoleArn = EXECUTION_ROLE_ARN,
Cpu = CONTAINER_CPU.ToString(),
Memory = CONTAINER_MEMORY.ToString(),
NetworkMode = NetworkMode.Awsvpc,
Family = "netfullframework45consoleapp-task-definition",
EphemeralStorage = new EphemeralStorage() { SizeInGiB = EPHEMERAL_STORAGE_SIZE_GIB },
ContainerDefinitions = new List<Amazon.ECS.Model.ContainerDefinition>()
{
new Amazon.ECS.Model.ContainerDefinition()
{
Name = "netfullframework45consoleapp-task-definition",
Image = "XXXXXXXXXX.dkr.ecr.eu-west-1.amazonaws.com/netfullframework45consoleapp:latest",
Cpu = CONTAINER_CPU,
Memory = CONTAINER_MEMORY,
Essential = true
//I REMOVED THE LOG DEFINITION TO SIMPLIFY THE PROBLEM
//,
//LogConfiguration = new Amazon.ECS.Model.LogConfiguration()
//{
// LogDriver = LogDriver.Awslogs,
// Options = new Dictionary<string, string>()
// {
// { "awslogs-create-group", "true"},
// { "awslogs-group", $"/ecs/{TASK_DEFINITION_NAME}" },
// { "awslogs-region", AWS_REGION },
// { "awslogs-stream-prefix", $"{TASK_DEFINITION_NAME}" }
// }
//}
}
}
});
these are the role policies contained used by the task AmazonECSTaskExecutionRolePolicy
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "*"
}
]
}
i got this error when lunch the task
CannotPullContainerError: ref pull has been retried 1 time(s): failed to extract layer sha256:fe48cee89971abac42eedb9110b61867659df00fc5b0b90dd91d6e19f704d935: link /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/212/fs/Files/ProgramData/Microsoft/Event Viewer/Views/ServerRoles/RemoteDesktop.Events.xml /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/212/fs/Files/Windows/Microsoft.NET/assembly/GAC_64/Microsoft.Windows.ServerManager.RDSPlugin/v4.0_10.0.0.0__31bf3856ad364e35/RemoteDesktop.Events.xml: no such file or directory: unknown
some search drived me here:
https://aws.amazon.com/it/premiumsupport/knowledge-center/ecs-pull-container-api-error-ecr/
the point 1 says that if i run the task on the private subnet (like i'm doing) i need a NAT with related route to garantee the communication towards the ECR, but
note that in my infrastructure i've a VPC Endpoint to the ECR....
so the first question is: is a VPC Endpoint sufficent to garantee the comunication from the container to the container images registry(ECR)? or i need necessarily to implement what the point 1 say (NAT and route on the route table) or eventually run the task on a public subnet?
Can be the error related to the missing communication towards the ECR, or could be a missing policy problem?
Make sure your VPC endpoint is configured correctly. Note that
"Amazon ECS tasks hosted on Fargate using platform version 1.4.0 or later require both the com.amazonaws.region.ecr.dkr and com.amazonaws.region.ecr.api Amazon ECR VPC endpoints as well as the Amazon S3 gateway endpoint to take advantage of this feature."
See https://docs.aws.amazon.com/AmazonECR/latest/userguide/vpc-endpoints.html for more information
In the first paragraph of the page I linked: "You don't need an internet gateway, a NAT device, or a virtual private gateway."

pod identity on aks cluster crreation

Right now, it's impossible to have assigned user assigned identities on arm templates (and terraform) on cluster creation. I already tried a lot of things, and updates works great, after inserting manually with:
az aks pod-identity add --cluster-name my-aks-cn --resource-group myrg --namespace myns --name example-pod-identity --identity-resource-id /subscriptions/......
But, I want to have this done at once, with the deployment, so I need to insert the pod user identities to the cluster automatically. I also tried to run the command using the DeploymentScripts but the deployment scripts are not ready to use preview aks extersion.
My config looks like this:
{
"type": "Microsoft.ContainerService/managedClusters",
"apiVersion": "2021-02-01",
"name": "[variables('cluster_name')]",
"location": "[variables('location')]",
"dependsOn": [
"[resourceId('Microsoft.Network/virtualNetworks', variables('vnet_name'))]"
],
"properties": {
....
"podIdentityProfile": {
"allowNetworkPluginKubenet": null,
"enabled": true,
"userAssignedIdentities": [
{
"identity": {
"clientId": "[reference(resourceId('Microsoft.ManagedIdentity/userAssignedIdentities', 'managed-indentity'), '2018-11-30').clientId]",
"objectId": "[reference(resourceId('Microsoft.ManagedIdentity/userAssignedIdentities', 'managed-indentity'), '2018-11-30').principalId]",
"resourceId": "[resourceId('Microsoft.ManagedIdentity/userAssignedIdentities', 'managed-indentity')]"
},
"name": "managed-indentity",
"namespace": "myns"
}
],
"userAssignedIdentityExceptions": null
},
....
},
"identity": {
"type": "SystemAssigned"
}
},
I'm always getting the same issue:
"statusMessage": "{\"error\":{\"code\":\"InvalidTemplateDeployment\",\"message\":\"The template deployment 'deployment_test' is not valid according to the validation procedure. The tracking id is '.....'. See inner errors for details.\",\"details\":[{\"code\":\"PodIdentityAddonUserAssignedIdentitiesNotAllowedInCreation\",\"message\":\"Provisioning of resource(s) for container service cluster-12344 in resource group myrc failed. Message: {\\n \\\"code\\\": \\\"PodIdentityAddonUserAssignedIdentitiesNotAllowedInCreation\\\",\\n \\\"message\\\": \\\"PodIdentity addon does not support assigning pod identities on creation.\\\"\\n }. Details: \"}]}}",
The Product team has shared the answer here: https://github.com/Azure/aad-pod-identity/issues/1123
which says:
This is a known limitation in the existing configuration. We will fix
this in the V2 implementation.
For others who are facing the same issue, please refer to the GitHub issue above.

Grafana 7.4.3 /var/lib/grafana not writeable in AWS ECS - EFS

I'm trying to host a Grafana 7.4 image in ECS Fargate using an EFS volume for persistent storage.
Using Terraform I have created the required resource and given the task access to the EFS volume via an "access point"
resource "aws_efs_access_point" "default" {
file_system_id = aws_efs_file_system.default.id
posix_user {
gid = 0
uid = 472
}
root_directory {
path = "/opt/data"
creation_info {
owner_gid = 0
owner_uid = 472
permissions = "600"
}
}
}
Note that I have set owner permissions as per the guides in https://grafana.com/docs/grafana/latest/installation/docker/#migrate-from-previous-docker-containers-versions (I've tried both group id 0 and 1 as the documentation seems to be inconsistent on the gid).
Using a base alpine image in place of the grafana image I've confirmed the directory /var/lib/grafana exists within container with the correct uid and gids set. However on attempting to run the grafana image I get the error message
GF_PATHS_DATA='/var/lib/grafana' is not writable.
I am launching the task with a terraformed task definition.
resource "aws_ecs_task_definition" "default" {
family = "${var.name}"
container_definitions = "${data.template_file.container_definition.rendered}"
memory = "${var.memory}"
cpu = "${var.cpu}"
requires_compatibilities = [
"FARGATE"
]
network_mode = "awsvpc"
execution_role_arn = "arn:aws:iam::REDACTED_ID:role/ecsTaskExecutionRole"
volume {
name = "${var.name}-volume"
efs_volume_configuration {
file_system_id = aws_efs_file_system.default.id
transit_encryption = "ENABLED"
root_directory = "/opt/data"
authorization_config {
access_point_id = aws_efs_access_point.default.id
}
}
}
tags = {
Product = "${var.name}"
}
}
With the container definition
[
{
"portMappings": [
{
"hostPort": 80,
"protocol": "tcp",
"containerPort": 80
}
],
"mountPoints": [
{
"sourceVolume": "${volume_name}",
"containerPath": "/var/lib/grafana",
"readOnly": false
}
],
"cpu": 0,
"secrets": [
...
],
"environment": [],
"image": "grafana/grafana:7.4.3",
"name": "${name}",
"user": "472:0"
}
]
For "user" I have tried "grafana", "742:0", "742" and "742:1" when trying gid 1.
I believe the terraform, security groups, mount_targets, etc... are all correct as I can get an alpine image to:
ls -lash /var/lib
> drw------- 2 472 root 6.0K Mar 12 11:22 grafana
I believe you have a problem, because AWS ECS https://github.com/aws/containers-roadmap/issues/938
Anyway, file system approach doesn't seems to be very cloud friendly (especially if you want to scale horizontally: problems with concurrent writes from multiple tasks, IOPs limitations, ...). Just provision proper DB (e.g. Aurora RDS Mysql, multi A-Z cluster if you need HA) and you will have nice opsless AWS deployment.

Azure service fabric node type instance count doubled on creating cluster using ARM

I'm experimenting on creating a new service fabric cluster using ARM template and modify the template to add certificates, etc. The cluster and all resources are successfully created, but I noticed that initially the number of node instances are 2x, plus 1 than what I set to. For example, if I set "vmInstanceCount" to 3, I see 7 instances are currently creating.
But if I just wait and let them finish, then 4 instances were deleted and it will keep the three instances. One problem here is that it randomly select what to keep, thus, the names to keep can be node_1, node_4, node_6 which is messy.
Here's my snippet of nodeType:
"nodeTypes": [
{
"name": "[variables('vmNodeType0Name')]",
"applicationPorts": {
"endPort": 30000,
"startPort": 20000
},
"clientConnectionEndpointPort": "[variables('fabricTcpGatewayPort')]",
"ephemeralPorts": {
"endPort": 65534,
"startPort": 49152
},
"httpGatewayEndpointPort": "[variables('fabricHttpGatewayPort')]",
"isPrimary": true,
"vmInstanceCount": "[variables('vmInstanceCount')]",
"reverseProxyEndpointPort": "[variables('reverseProxyEndpointPort')]",
"durabilityLevel": "Bronze"
}
]
...
"sku": {
"name": "[variables('vmssSkuName')]",
"capacity": "[variables('vmssSkuCapacity')]",
"tier": "Standard"
}
I was talking to a Microsoft support earlier and this issue is actually a new feature as we can see here https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-design-overview#overprovisioning
I will close this issue as I found the answer. However, I still have some concern on the naming part but I will throw that question to MS.