CDK python Sagemaker endpoint subnet deletion blocked by ENI - aws-cloudformation

I am deploying with a public and private subnet in two AZs (see code below). However, when I delete I get the following error.
"The subnet 'subnet-0d50f818f269ce4f2' has dependencies and cannot be deleted. (Service: Ec2, Status Code: 400...
When I try to manually delete the subnet it shows that I have a network interfact attached.
Under EC2 > Network instances, I see that the stack has created a nat gateway for each public subnet (2 azs), but an ENI for only one of private subnets (with description SageMaker managed ENI for endpoint mymodelendpoint)? After I delete this ENI the delete process finishes. If I want to delete the ENI manually before cdk destroy has tried and failed I first have to detach it. However, if I run destroy first it is already detached and I can delete manually right away.
Why did it create ENI for only one of the private subnets (I have to azs)? Why do I have to manually delete it? Is there a way to get cdk to delete this as part of the destroy command?
Could it be that cdk destroy somehow doing detach/delete out of order and it fails but leaves the ENI detached?
Thanks
# https://klichx.dev/2021/03/22/sagemaker-deployments-with-aws-cdk/
contrainer_definition_prop = sagemaker.CfnModel.ContainerDefinitionProperty(
image=f'763104351884.dkr.ecr.{REGION}.amazonaws.com/huggingface-pytorch-inference:1.9.0-transformers4.11.0-cpu-py38-ubuntu20.04',
image_config = sagemaker.CfnModel.ImageConfigProperty(
repository_access_mode="Platform"),
container_hostname=f"hf-model-container",
model_data_url=f"s3://sagemaker-{REGION}-{ACCOUNT_ID}/finetune-distilbert-base-cased-2023-02-10-00-55-05/output/model.tar.gz",
)
vpc = ec2.Vpc(
self,
"VPC",
max_azs=2,
subnet_configuration=[
ec2.SubnetConfiguration(
name="public-model-subnet", subnet_type=ec2.SubnetType.PUBLIC
),
ec2.SubnetConfiguration(
name="private-model-subnet", subnet_type=ec2.SubnetType.PRIVATE_WITH_EGRESS
),
],
gateway_endpoints={
"S3": ec2.GatewayVpcEndpointOptions(
service=ec2.GatewayVpcEndpointAwsService.S3
)
},
)
print('vpc.private_subnets', vpc.private_subnets)
model_vpc_config = sagemaker.CfnModel.VpcConfigProperty(
security_group_ids=[vpc.vpc_default_security_group],
subnets=[s.subnet_id for s in vpc.private_subnets],
)
mymodel = sagemaker.CfnModel(
self,
"Inference",
execution_role_arn=self.sm_execution_role.role_arn,
model_name="triage-multi-" + str(int(time.time())),
#containers=[contrainer_definition_prop],
primary_container=contrainer_definition_prop,
# inference_execution_config=sagemaker.CfnModel.InferenceExecutionConfigProperty(
# mode="Direct"
# ),
vpc_config=model_vpc_config
)
model_endpoint_config = sagemaker.CfnEndpointConfig(
self,
"mymodel-endpoint-config",
production_variants=[
sagemaker.CfnEndpointConfig.ProductionVariantProperty(
initial_instance_count=1,
initial_variant_weight=1.0,
instance_type="ml.m5.xlarge",
model_name=mymodel.model_name,
variant_name="production-medium",
),
],
)
model_endpoint_config.add_depends_on(mymodel)
self.model_endpoint = sagemaker.CfnEndpoint(
self,
"mymodel-endpoint", endpoint_config_name=model_endpoint_config.attr_endpoint_config_name,
)

Related

How to deploy the kinesis-video-producer Docker image from AWS's own ECR to Fargate using CDK in TypeScript?

I'm trying to stand up a proof of concept that ingests an RTSP video stream into Kinesis Video. The provided documentation has a docker image all set up that seems to have everything I need to do this, hosted by AWS on 546150905175.dkr.ecr.us-west-2.amazonaws.com. What I am having trouble with, though, is getting that deployment (via an Amplify Custom category, in TypeScript CDK) to work.
I've tried different variations on
import * as iam from "#aws-cdk/aws-iam";
import * as ecs from "#aws-cdk/aws-ecs";
import * as ec2 from "#aws-cdk/aws-ec2";
const kinesisUserAccessKey = new iam.AccessKey(this, 'KinesisStreamUserAccessKey', {
user: kinesisStreamUser,
})
const servicePrincipal = new iam.ServicePrincipal('ecs-tasks.amazonaws.com');
const executionRole = new iam.Role(this, 'IngestVideoTaskDefExecutionRole', {
assumedBy: servicePrincipal,
managedPolicies: [
iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AmazonECSTaskExecutionRolePolicy'),
]
});
const taskDefinition = new ecs.FargateTaskDefinition(this, 'IngestVideoTaskDef', {
cpu: 512,
memoryLimitMiB: 1024,
executionRole,
})
const image = ecs.ContainerImage.fromRegistry('546150905175.dkr.ecr.us-west-2.amazonaws.com/kinesis-video-producer-sdk-cpp-amazon-linux:latest');
taskDefinition.addContainer('IngestVideoContainer', {
command: [
'gst-launch-1.0',
'rtspsrc',
`location="${locationParam.secretValue.toString()}"`,
'short-header=TRUE',
'!',
'rtph264depay',
'!',
'video/x-h264,',
'format=avc,alignment=au',
'!',
'kvssink',
`stream-name="${cfnStream.name}"`,
'storage-size=512',
`access-key="${kinesisUserAccessKey.accessKeyId}"`,
`secret-key="${kinesisUserAccessKey.secretAccessKey.toString()}"`,
`aws-region="${REGION}"`,
// `aws-region="${cdk.Aws.REGION}"`,
],
image,
logging: new ecs.AwsLogDriver({
streamPrefix: 'IngestVideoContainer',
}),
})
const service = new ecs.FargateService(this, 'IngestVideoService', {
cluster,
taskDefinition,
desiredCount: 1,
securityGroups: [
ec2.SecurityGroup.fromSecurityGroupId(this, 'DefaultSecurityGroup', SECURITY_GROUP_ID)
],
vpcSubnets: {
subnets: SUBNET_IDS.map(subnetId => ec2.Subnet.fromSubnetId(this, subnetId, subnetId)),
}
})
But it seems like regardless of what I do, an amplify push just stays in 'in progress' for like an hour until I go into the CloudFormation console and cancel the stack update, but deep in the my way to the ECS Console I managed to find an actual error message:
Resourceinitializationerror: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 3 time(s): RequestError: send request failed caused by: Post "https://api.ecr.us-west-2.amazonaws.com/": dial tcp 52.94.177.118:443: i/o timeout
It seems to be some kind of networking issue, but I'm not sure how to proceed. Any assistance you can provide would be wonderful. Cheers!
Figured it out. For those stuck with similar issues, you have to give it an execution role with AmazonECSTaskExecutionRolePolicy, which I already edited above, and set assignPublicIp: true in the service.

mock outputs in Terragrunt dependency

I want to use Terragrunt to deploy this example: https://github.com/aws-ia/terraform-aws-eks-blueprints/blob/main/examples/complete-kubernetes-addons/main.tf
So far, I was able to create the VPC/EKS resource without a problem, I separated each module into a different module directory, and everything worked as expected.
When I tried to do the same for the Kubernetes-addons module, I faced an issue with the data source trying to call to the cluster and failing since the cluster wasn't created at this point.
Here's my terragrunt.hcl which I'm trying to execute for this specific module:
...
terraform {
source = "git::git#github.com:aws-ia/terraform-aws-eks-blueprints.git//modules/kubernetes-addons?ref=v4.6.1"
}
locals {
# Extract needed variables for reuse
cluster_version = "${include.envcommon.locals.cluster_version}"
name = "${include.envcommon.locals.name}"
}
dependency "eks" {
config_path = "../eks"
mock_outputs = {
eks_cluster_endpoint = "https://000000000000.gr7.eu-west-3.eks.amazonaws.com"
eks_oidc_provider = "something"
eks_cluster_id = "something"
}
}
inputs = {
eks_cluster_id = dependency.eks.outputs.cluster_id
eks_cluster_endpoint = dependency.eks.outputs.eks_cluster_endpoint
eks_oidc_provider = dependency.eks.outputs.eks_oidc_provider
eks_cluster_version = local.cluster_version
...
}
The error that I'm getting here:
`
INFO[0035]
Error: error reading EKS Cluster (something): couldn't find resource
with data.aws_eks_cluster.eks_cluster,
on data.tf line 7, in data "aws_eks_cluster" "eks_cluster":
7: data "aws_eks_cluster" "eks_cluster" {
`
The kubernetes-addons module is deploying addons into an existing Kubernetes cluster. If you don't have a cluster running (apparently you don't have one when you're mocking the cluster_id variable), then you get the error of not having the aws_eks_cluster data source.
You need to create the K8s cluster first, before you can start deploying the addons.

CannotPullContainerError: failed to extract layer

I'm trying to run a task on a windows container in fargate mode on aws
The container is a .net console application (Fullframework 4.5)
This is the task definition generated programmatically by SDK
var taskResponse = await ecsClient.RegisterTaskDefinitionAsync(new Amazon.ECS.Model.RegisterTaskDefinitionRequest()
{
RequiresCompatibilities = new List<string>() { "FARGATE" },
TaskRoleArn = TASK_ROLE_ARN,
ExecutionRoleArn = EXECUTION_ROLE_ARN,
Cpu = CONTAINER_CPU.ToString(),
Memory = CONTAINER_MEMORY.ToString(),
NetworkMode = NetworkMode.Awsvpc,
Family = "netfullframework45consoleapp-task-definition",
EphemeralStorage = new EphemeralStorage() { SizeInGiB = EPHEMERAL_STORAGE_SIZE_GIB },
ContainerDefinitions = new List<Amazon.ECS.Model.ContainerDefinition>()
{
new Amazon.ECS.Model.ContainerDefinition()
{
Name = "netfullframework45consoleapp-task-definition",
Image = "XXXXXXXXXX.dkr.ecr.eu-west-1.amazonaws.com/netfullframework45consoleapp:latest",
Cpu = CONTAINER_CPU,
Memory = CONTAINER_MEMORY,
Essential = true
//I REMOVED THE LOG DEFINITION TO SIMPLIFY THE PROBLEM
//,
//LogConfiguration = new Amazon.ECS.Model.LogConfiguration()
//{
// LogDriver = LogDriver.Awslogs,
// Options = new Dictionary<string, string>()
// {
// { "awslogs-create-group", "true"},
// { "awslogs-group", $"/ecs/{TASK_DEFINITION_NAME}" },
// { "awslogs-region", AWS_REGION },
// { "awslogs-stream-prefix", $"{TASK_DEFINITION_NAME}" }
// }
//}
}
}
});
these are the role policies contained used by the task AmazonECSTaskExecutionRolePolicy
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "*"
}
]
}
i got this error when lunch the task
CannotPullContainerError: ref pull has been retried 1 time(s): failed to extract layer sha256:fe48cee89971abac42eedb9110b61867659df00fc5b0b90dd91d6e19f704d935: link /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/212/fs/Files/ProgramData/Microsoft/Event Viewer/Views/ServerRoles/RemoteDesktop.Events.xml /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/212/fs/Files/Windows/Microsoft.NET/assembly/GAC_64/Microsoft.Windows.ServerManager.RDSPlugin/v4.0_10.0.0.0__31bf3856ad364e35/RemoteDesktop.Events.xml: no such file or directory: unknown
some search drived me here:
https://aws.amazon.com/it/premiumsupport/knowledge-center/ecs-pull-container-api-error-ecr/
the point 1 says that if i run the task on the private subnet (like i'm doing) i need a NAT with related route to garantee the communication towards the ECR, but
note that in my infrastructure i've a VPC Endpoint to the ECR....
so the first question is: is a VPC Endpoint sufficent to garantee the comunication from the container to the container images registry(ECR)? or i need necessarily to implement what the point 1 say (NAT and route on the route table) or eventually run the task on a public subnet?
Can be the error related to the missing communication towards the ECR, or could be a missing policy problem?
Make sure your VPC endpoint is configured correctly. Note that
"Amazon ECS tasks hosted on Fargate using platform version 1.4.0 or later require both the com.amazonaws.region.ecr.dkr and com.amazonaws.region.ecr.api Amazon ECR VPC endpoints as well as the Amazon S3 gateway endpoint to take advantage of this feature."
See https://docs.aws.amazon.com/AmazonECR/latest/userguide/vpc-endpoints.html for more information
In the first paragraph of the page I linked: "You don't need an internet gateway, a NAT device, or a virtual private gateway."

aws-ecs-patterns error: Cluster for this service needs Ec2 capacity. Call addXxxCapacity() on the cluster

Hoping someone can help me here, according to AWS CDK documentation if I declare my VPC then I shouldn't declare 'capacity', but when I run cdk synth I get the following error...
throw new Error(Validation failed with the following errors:\n ${errorList});
Error: Validation failed with the following errors:
[PrerenderInfrasctutureStack/preRenderApp/Service] Cluster for this service needs Ec2 capacity. Call addXxxCapacity() on the cluster.
here is my code...
(i hope Nathan Peck sees this)
const ec2 = require('#aws-cdk/aws-ec2');
const ecsPattern = require('#aws-cdk/aws-ecs-patterns');
const ecs = require('#aws-cdk/aws-ecs');
class PrerenderInfrasctutureStack extends cdk.Stack {
/**
*
* #param {cdk.Construct} scope
* #param {string} id
* #param {cdk.StackProps=} props
*/
constructor(scope, id, props) {
super(scope, id, props);
const myVPC = ec2.Vpc.fromLookup(this, 'publicVpc', {
vpcId:'vpc-xxx'
});
const preRenderApp = new ecsPattern.ApplicationLoadBalancedEc2Service(this, 'preRenderApp', {
vpcId: myVPC,
certificate: 'arn:aws:acm:ap-southeast-2:xxx:certificate/xxx', //becuase this is spcified, then the LB will automatically use HTTPS
domainName: 'my-dev.com.au.',
domainZone:'my-dev.com.au',
listenerPort: 443,
publicLoadBalancer: true,
memoryReservationMiB: 8,
cpu: 4096,
desiredCount: 1,
taskImageOptions:{
image: ecs.ContainerImage.fromRegistry('xxx.dkr.ecr.region.amazonaws.com/express-prerender-server'),
containerPort: 3000
},
});
}
}
module.exports = { PrerenderInfrasctutureStack }
This is because if you don't explicitly pass a cluster then it uses the default cluster that exists on your account. However the default cluster starts out with no EC2 capacity, since EC2 instances cost money when they run. You can use the empty default cluster with Fargate mode since Fargate does not require EC2 capacity, it just runs your container inside Fargate, but the default cluster won't work with EC2 mode until you add EC2 instances to the cluster.
The easy solution here is to switch to ApplicationLoadBalancedFargateService instead, because Fargate services run using Fargate capacity, so they don't require EC2 instances in the cluster. Alternatively you should define your own cluster using something like:
// Create an ECS cluster
const cluster = new ecs.Cluster(this, 'Cluster', {
vpc,
});
// Add capacity to it
cluster.addCapacity('DefaultAutoScalingGroupCapacity', {
instanceType: new ec2.InstanceType("t2.xlarge"),
desiredCapacity: 3,
});
Then pass that cluster as a property when creating the ApplicationLoadBalancedEc2Service
Hope this helps!

cloudify custom workflow missing cloudify_agent runtime information

I want to develop my own workflow named "backup" in cloudify with my own plugin, but when i ran that workflow, the below error occured
'backup' workflow execution failed: RuntimeError: Workflow failed: Task failed 'script_runner.tasks.run' -> Missing cloudify_agent runtime information. This most likely means that the Compute node never started successfully
I don't understand why, anybody can solved me this problem?
Here is my main blueprint code and plugin code
My main blueprint
tosca_definitions_version: cloudify_dsl_1_2
imports:
- plugins/backup.yaml
- types/types.yaml
node_templates:
mynode:
type: cloudify.nodes.Compute
properties:
ip: "ip"
agent_config:
install_method: none
user: "user"
key: "key_uri"
myapp:
type: cloudify.nodes.ApplicationModule
interfaces:
test_platform_backup:
backup:
implementation: scripts/backup.sh
inputs:
port: 6969
post_backup:
implementation: scripts/post_backup.sh
relationships:
- type: cloudify.relationships.contained_in
target: mynode
My plugin code:
from cloudify.decorators import workflow
from cloudify.workflows import ctx
from cloudify.workflows.tasks_graph import forkjoin
#workflow
def backup(operation, type_name, operation_kwargs, is_node_operation, **kwargs):
graph = ctx.graph_mode()
send_event_starting_tasks = {}
send_event_done_tasks = {}
for node in ctx.nodes:
if type_name in node.type_hierarchy:
for instance in node.instances:
send_event_starting_tasks[instance.id] = instance.send_event('Starting to run operation')
send_event_done_tasks[instance.id] = instance.send_event('Done running operation')
for node in ctx.nodes:
if type_name in node.type_hierarchy:
for instance in node.instances:
sequence = graph.sequence()
if is_node_operation:
operation_task = instance.execute_operation(operation, kwargs=operation_kwargs)
else:
forkjoin_tasks = []
for relationship in instance.relationships:
forkjoin_tasks.append(relationship.execute_source_operation(operation))
forkjoin_tasks.append(relationship.execute_target_operation(operation))
operation_task = forkjoin(*forkjoin_tasks)
sequence.add(
send_event_starting_tasks[instance.id],
operation_task,
send_event_done_tasks[instance.id])
for node in ctx.nodes:
for instance in node.instances:
for rel in instance.relationships:
instance_starting_task = send_event_starting_tasks.get(instance.id)
target_done_task = send_event_done_tasks.get(rel.target_id)
if instance_starting_task and target_done_task:
graph.add_dependency(instance_starting_task, target_done_task)
return graph.execute()
It seems that your VM did not start.
From your code I can't understand what you are trying to do.
You don't install and agent and you don't have a fabric connection to the VM, yet you are trying to run operations on the VM.
You should either install an agent, E.g remove the "install_method: none", or add a fabric connection to the VM and run the operations with it.