CannotPullContainerError: failed to extract layer - amazon-ecs

I'm trying to run a task on a windows container in fargate mode on aws
The container is a .net console application (Fullframework 4.5)
This is the task definition generated programmatically by SDK
var taskResponse = await ecsClient.RegisterTaskDefinitionAsync(new Amazon.ECS.Model.RegisterTaskDefinitionRequest()
{
RequiresCompatibilities = new List<string>() { "FARGATE" },
TaskRoleArn = TASK_ROLE_ARN,
ExecutionRoleArn = EXECUTION_ROLE_ARN,
Cpu = CONTAINER_CPU.ToString(),
Memory = CONTAINER_MEMORY.ToString(),
NetworkMode = NetworkMode.Awsvpc,
Family = "netfullframework45consoleapp-task-definition",
EphemeralStorage = new EphemeralStorage() { SizeInGiB = EPHEMERAL_STORAGE_SIZE_GIB },
ContainerDefinitions = new List<Amazon.ECS.Model.ContainerDefinition>()
{
new Amazon.ECS.Model.ContainerDefinition()
{
Name = "netfullframework45consoleapp-task-definition",
Image = "XXXXXXXXXX.dkr.ecr.eu-west-1.amazonaws.com/netfullframework45consoleapp:latest",
Cpu = CONTAINER_CPU,
Memory = CONTAINER_MEMORY,
Essential = true
//I REMOVED THE LOG DEFINITION TO SIMPLIFY THE PROBLEM
//,
//LogConfiguration = new Amazon.ECS.Model.LogConfiguration()
//{
// LogDriver = LogDriver.Awslogs,
// Options = new Dictionary<string, string>()
// {
// { "awslogs-create-group", "true"},
// { "awslogs-group", $"/ecs/{TASK_DEFINITION_NAME}" },
// { "awslogs-region", AWS_REGION },
// { "awslogs-stream-prefix", $"{TASK_DEFINITION_NAME}" }
// }
//}
}
}
});
these are the role policies contained used by the task AmazonECSTaskExecutionRolePolicy
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "*"
}
]
}
i got this error when lunch the task
CannotPullContainerError: ref pull has been retried 1 time(s): failed to extract layer sha256:fe48cee89971abac42eedb9110b61867659df00fc5b0b90dd91d6e19f704d935: link /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/212/fs/Files/ProgramData/Microsoft/Event Viewer/Views/ServerRoles/RemoteDesktop.Events.xml /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/212/fs/Files/Windows/Microsoft.NET/assembly/GAC_64/Microsoft.Windows.ServerManager.RDSPlugin/v4.0_10.0.0.0__31bf3856ad364e35/RemoteDesktop.Events.xml: no such file or directory: unknown
some search drived me here:
https://aws.amazon.com/it/premiumsupport/knowledge-center/ecs-pull-container-api-error-ecr/
the point 1 says that if i run the task on the private subnet (like i'm doing) i need a NAT with related route to garantee the communication towards the ECR, but
note that in my infrastructure i've a VPC Endpoint to the ECR....
so the first question is: is a VPC Endpoint sufficent to garantee the comunication from the container to the container images registry(ECR)? or i need necessarily to implement what the point 1 say (NAT and route on the route table) or eventually run the task on a public subnet?
Can be the error related to the missing communication towards the ECR, or could be a missing policy problem?

Make sure your VPC endpoint is configured correctly. Note that
"Amazon ECS tasks hosted on Fargate using platform version 1.4.0 or later require both the com.amazonaws.region.ecr.dkr and com.amazonaws.region.ecr.api Amazon ECR VPC endpoints as well as the Amazon S3 gateway endpoint to take advantage of this feature."
See https://docs.aws.amazon.com/AmazonECR/latest/userguide/vpc-endpoints.html for more information
In the first paragraph of the page I linked: "You don't need an internet gateway, a NAT device, or a virtual private gateway."

Related

GCP: using image from one account's Artifact Registry on other account

Hello and wish you a great time!
I've got following terraform service account deffinition:
resource "google_service_account" "gke_service_account" {
project = var.context
account_id = var.gke_account_name
display_name = var.gke_account_description
}
That I use in GCP kubernetes node pool:
resource "google_container_node_pool" "gke_node_pool" {
name = "${var.context}-gke-node"
location = var.region
project = var.context
cluster = google_container_cluster.gke_cluster.name
management {
auto_repair = "true"
auto_upgrade = "true"
}
autoscaling {
min_node_count = var.gke_min_node_count
max_node_count = var.gke_max_node_count
}
initial_node_count = var.gke_min_node_count
node_config {
machine_type = var.gke_machine_type
service_account = google_service_account.gke_service_account.email
metadata = {
disable-legacy-endpoints = "true"
}
# Needed for correctly functioning cluster, see
# https://www.terraform.io/docs/providers/google/r/container_cluster.html#oauth_scopes
oauth_scopes = [
"https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/monitoring",
"https://www.googleapis.com/auth/devstorage.read_only",
"https://www.googleapis.com/auth/servicecontrol",
"https://www.googleapis.com/auth/cloud-platform"
]
}
}
However the current solution requires the prod and dev envs to be on various GCP accounts but use the same image from prod artifact registry.
As for now I have JSON key file for service account in prod having access to it's registry. Maybe there's a pretty way to use the json file as a second service account for kubernetes or update current k8s service account with json file to have additional permissions to the remote registry?
I've seen the solutions like put it to a secret or user cross-account-service-account.
But it's not the way I want to resolve it since we have some internal restrictions.
Hope someone faced similar task and has a solution to share - it'll save me real time.
Thanks in advance!

Can somene help me how to give the specific service principal name for azure custom policy in terraform code

The Scenario is:We need to create resources in azure portal through terraform cloud only by using service principals and azure portal shouldn't allow creation of resources by other means.
(No manual creation in azure portal,and also by SVM's tools like vs code)
The following code which i tried:
resource "azurerm_policy_definition" "policych" {
name = "PolicyTestSP"
policy_type = "Custom"
mode = "Indexed"
display_name = "Allow thru SP"
metadata = <<METADATA
{
"category": "General"
}
METADATA
policy_rule = <<POLICY_RULE
{
"if": {
"not": {
"field": "type",
"equals": "https://app.terraform.io/"
}
},
"then": {
"effect": "deny"
}
}
POLICY_RULE
}
resource "azurerm_policy_assignment" "policych" {
name = "Required servive-principles"
display_name = "Allow thru SP"
description = " Require SP Authentication'"
policy_definition_id = "${azurerm_policy_definition.policych.id}"
scope = "/subscriptions/xxxx-xxxx-xxxx-xxxx"
}

Grafana 7.4.3 /var/lib/grafana not writeable in AWS ECS - EFS

I'm trying to host a Grafana 7.4 image in ECS Fargate using an EFS volume for persistent storage.
Using Terraform I have created the required resource and given the task access to the EFS volume via an "access point"
resource "aws_efs_access_point" "default" {
file_system_id = aws_efs_file_system.default.id
posix_user {
gid = 0
uid = 472
}
root_directory {
path = "/opt/data"
creation_info {
owner_gid = 0
owner_uid = 472
permissions = "600"
}
}
}
Note that I have set owner permissions as per the guides in https://grafana.com/docs/grafana/latest/installation/docker/#migrate-from-previous-docker-containers-versions (I've tried both group id 0 and 1 as the documentation seems to be inconsistent on the gid).
Using a base alpine image in place of the grafana image I've confirmed the directory /var/lib/grafana exists within container with the correct uid and gids set. However on attempting to run the grafana image I get the error message
GF_PATHS_DATA='/var/lib/grafana' is not writable.
I am launching the task with a terraformed task definition.
resource "aws_ecs_task_definition" "default" {
family = "${var.name}"
container_definitions = "${data.template_file.container_definition.rendered}"
memory = "${var.memory}"
cpu = "${var.cpu}"
requires_compatibilities = [
"FARGATE"
]
network_mode = "awsvpc"
execution_role_arn = "arn:aws:iam::REDACTED_ID:role/ecsTaskExecutionRole"
volume {
name = "${var.name}-volume"
efs_volume_configuration {
file_system_id = aws_efs_file_system.default.id
transit_encryption = "ENABLED"
root_directory = "/opt/data"
authorization_config {
access_point_id = aws_efs_access_point.default.id
}
}
}
tags = {
Product = "${var.name}"
}
}
With the container definition
[
{
"portMappings": [
{
"hostPort": 80,
"protocol": "tcp",
"containerPort": 80
}
],
"mountPoints": [
{
"sourceVolume": "${volume_name}",
"containerPath": "/var/lib/grafana",
"readOnly": false
}
],
"cpu": 0,
"secrets": [
...
],
"environment": [],
"image": "grafana/grafana:7.4.3",
"name": "${name}",
"user": "472:0"
}
]
For "user" I have tried "grafana", "742:0", "742" and "742:1" when trying gid 1.
I believe the terraform, security groups, mount_targets, etc... are all correct as I can get an alpine image to:
ls -lash /var/lib
> drw------- 2 472 root 6.0K Mar 12 11:22 grafana
I believe you have a problem, because AWS ECS https://github.com/aws/containers-roadmap/issues/938
Anyway, file system approach doesn't seems to be very cloud friendly (especially if you want to scale horizontally: problems with concurrent writes from multiple tasks, IOPs limitations, ...). Just provision proper DB (e.g. Aurora RDS Mysql, multi A-Z cluster if you need HA) and you will have nice opsless AWS deployment.

How to set aws cloudwatch retention via Terraform

Using Terraform to deploy API Gateway/Lambda and already have the appropriate logs in Cloudwatch. However I can't seem to find a way to set the retention on the logs via Terraform, using my currently deployed resources (below). It looks like the log group resource is where I'd do it, but not sure how to point log stream from api gateway at the new log group. I must be missing something obvious ... any advice is very much appreciated!
resource "aws_api_gateway_account" "name" {
cloudwatch_role_arn = "${aws_iam_role.cloudwatch.arn}"
}
resource "aws_iam_role" "cloudwatch" {
name = "#{name}_APIGatewayCloudWatchLogs"
assume_role_policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Principal": {
"Service": "apigateway.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
EOF
}
resource "aws_iam_policy_attachment" "api_gateway_logs" {
name = "#{name}_api_gateway_logs_policy_attach"
roles = ["${aws_iam_role.cloudwatch.id}"]
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonAPIGatewayPushToCloudWatchLogs"
}
resource "aws_api_gateway_method_settings" "name" {
rest_api_id = "${aws_api_gateway_rest_api.name.id}"
stage_name = "${aws_api_gateway_stage.name.stage_name}"
method_path = "${aws_api_gateway_resource.name.path_part}/${aws_api_gateway_method.name.http_method}"
settings {
metrics_enabled = true
logging_level = "INFO"
data_trace_enabled = true
}
}
yes, you can use the Lambda log name to create log resource before you create the Lambda function. Or you can import the existing log groups.
resource "aws_cloudwatch_log_group" "lambda" {
name = "/aws/lambda/${var.env}-${join("", split("_",title(var.lambda_name)))}-Lambda"
retention_in_days = 7
lifecycle {
create_before_destroy = true
prevent_destroy = false
}
}

Terraform: How to create a Kubernetes cluster on Google Cloud (GKE) with namespaces?

I'm after an example that would do the following:
Create a Kubernetes cluster on GKE via Terraform's google_container_cluster
... and continue creating namespaces in it, I suppose via kubernetes_namespace
The thing I'm not sure about is how to connect the newly created cluster and the namespace definition. For example, when adding google_container_node_pool, I can do something like cluster = "${google_container_cluster.hosting.name}" but I don't see anything similar for kubernetes_namespace.
In theory it is possible to reference resources from the GCP provider in K8S (or any other) provider in the same way you'd reference resources or data sources within the context of a single provider.
provider "google" {
region = "us-west1"
}
data "google_compute_zones" "available" {}
resource "google_container_cluster" "primary" {
name = "the-only-marcellus-wallace"
zone = "${data.google_compute_zones.available.names[0]}"
initial_node_count = 3
additional_zones = [
"${data.google_compute_zones.available.names[1]}"
]
master_auth {
username = "mr.yoda"
password = "adoy.rm"
}
node_config {
oauth_scopes = [
"https://www.googleapis.com/auth/compute",
"https://www.googleapis.com/auth/devstorage.read_only",
"https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/monitoring"
]
}
}
provider "kubernetes" {
host = "https://${google_container_cluster.primary.endpoint}"
username = "${google_container_cluster.primary.master_auth.0.username}"
password = "${google_container_cluster.primary.master_auth.0.password}"
client_certificate = "${base64decode(google_container_cluster.primary.master_auth.0.client_certificate)}"
client_key = "${base64decode(google_container_cluster.primary.master_auth.0.client_key)}"
cluster_ca_certificate = "${base64decode(google_container_cluster.primary.master_auth.0.cluster_ca_certificate)}"
}
resource "kubernetes_namespace" "n" {
metadata {
name = "blablah"
}
}
However in practice it may not work as expected due to a known core bug breaking cross-provider dependencies, see https://github.com/hashicorp/terraform/issues/12393 and https://github.com/hashicorp/terraform/issues/4149 respectively.
The alternative solution would be:
Use 2-staged apply and target the GKE cluster first, then anything else that depends on it, i.e. terraform apply -target=google_container_cluster.primary and then terraform apply
Separate out GKE cluster config from K8S configs, give them completely isolated workflow and connect those via remote state.
/terraform-gke/main.tf
terraform {
backend "gcs" {
bucket = "tf-state-prod"
prefix = "terraform/state"
}
}
provider "google" {
region = "us-west1"
}
data "google_compute_zones" "available" {}
resource "google_container_cluster" "primary" {
name = "the-only-marcellus-wallace"
zone = "${data.google_compute_zones.available.names[0]}"
initial_node_count = 3
additional_zones = [
"${data.google_compute_zones.available.names[1]}"
]
master_auth {
username = "mr.yoda"
password = "adoy.rm"
}
node_config {
oauth_scopes = [
"https://www.googleapis.com/auth/compute",
"https://www.googleapis.com/auth/devstorage.read_only",
"https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/monitoring"
]
}
}
output "gke_host" {
value = "https://${google_container_cluster.primary.endpoint}"
}
output "gke_username" {
value = "${google_container_cluster.primary.master_auth.0.username}"
}
output "gke_password" {
value = "${google_container_cluster.primary.master_auth.0.password}"
}
output "gke_client_certificate" {
value = "${base64decode(google_container_cluster.primary.master_auth.0.client_certificate)}"
}
output "gke_client_key" {
value = "${base64decode(google_container_cluster.primary.master_auth.0.client_key)}"
}
output "gke_cluster_ca_certificate" {
value = "${base64decode(google_container_cluster.primary.master_auth.0.cluster_ca_certificate)}"
}
Here we're exposing all the necessary configuration via outputs and use backend to store the state, along with these outputs in a remote location, GCS in this case. This enables us to reference it in the config below.
/terraform-k8s/main.tf
data "terraform_remote_state" "foo" {
backend = "gcs"
config {
bucket = "tf-state-prod"
prefix = "terraform/state"
}
}
provider "kubernetes" {
host = "https://${data.terraform_remote_state.foo.gke_host}"
username = "${data.terraform_remote_state.foo.gke_username}"
password = "${data.terraform_remote_state.foo.gke_password}"
client_certificate = "${base64decode(data.terraform_remote_state.foo.gke_client_certificate)}"
client_key = "${base64decode(data.terraform_remote_state.foo.gke_client_key)}"
cluster_ca_certificate = "${base64decode(data.terraform_remote_state.foo.gke_cluster_ca_certificate)}"
}
resource "kubernetes_namespace" "n" {
metadata {
name = "blablah"
}
}
What may or may not be obvious here is that cluster has to be created/updated before creating/updating any K8S resources (if such update relies on updates of the cluster).
Taking the 2nd approach is generally advisable either way (even when/if the bug was not a factor and cross-provider references worked) as it reduces the blast radius and defines much clearer responsibility. It's (IMO) common for such deployment to have 1 person/team responsible for managing the cluster and a different one for managing K8S resources.
There may certainly be overlaps though - e.g. ops wanting to deploy logging & monitoring infrastructure on top of a fresh GKE cluster, so cross provider dependencies aim to satisfy such use cases. For that reason I'd recommend subscribing to the GH issues mentioned above.