creating an alarm for sagemaker endpoint in cloudformation - aws-cloudformation

I am trying to create an alarm for a sagemaker endpoint using cloudformation. My endpoint has two variants. My cloud formation file looks similar to below:
MySagemakerAlarmCPUUtilization:
Type: 'AWS::CloudWatch::Alarm'
Properties:
AlarmName: MySagemakerAlarmCPUUtilization
AlarmDescription: Monitor the CPU levels of the endpoint
MetricName: CPUUtilization
ComparisonOperator: GreaterThanThreshold
Dimension:
- Name: EndpointName
Value: my-endpoint
- Name: VariantName
Value: variant1
Namespace: AWS/SageMaker/Endpoints
EvaluationPeriods: 1
Period: 600
Statistic: Average
Threshold: 50
I am having an issue though with the dimension part. I get an invalid property error here. Does anyone know the correct syntax to look at a particular variant of an endpoint in cloud formation?

Realised I just had a typo in this. It should read Dimensions. So:
Dimensions:
- Name: EndpointName
Value: my-endpoint
- Name: VariantName
Value: variant1
But the code is right if anyone else wanted to use it

Related

Query Cloud Watch Metrics on basis of Dimensions using Cloud Formation

I have installed CW agent on my EC2 Linux Machine and received disk_used_percemt metric of each partition. I want to create CW Alarm on only one partition. I'm getting the following dimensions for each metric,
Instance name, InstanceId, ImageId, device, fstype, path, Metric name
Now I want to create an alarm using CW where,
Namespace: CWAgent
Metric name: disk_used_percent
InstanceId: X
device: xvda1
I'm using the following CF code,
CloudWatchAlarm:
Type: "AWS::CloudWatch::Alarm"
Properties:
AlarmName: "disk-space-threshold"
AlarmDescription: "A Cloudwatch Alarm that triggers when disk space of EBS is less than 50%"
MetricName: "disk_used_percent"
Namespace: "CWAgent"
Statistic: "Average"
Period: "60"
EvaluationPeriods: "1"
Threshold: "75"
ComparisonOperator: "GreaterThanOrEqualToThreshold"
TreatMissingData: "missing"
Dimensions:
- Name: InstanceId
Value: !Ref InstanceID
- Name: ImageId
Value: !Ref ImageID
- Name: device
Value: !Ref Device
When an alarm is created, it is showing insufficient data. What can be the possible issue?
You can't filter by 3 dimensions. You always have to use full set of dimensions to identify a metric.

Envoy proxy is using too much memory

Envoy is using all the memory and the pods are getting evicted. Is there a way to set limit to how much memory envoy proxy can use in the envoy configuration file?
You can probably do that by configuring the overload-manager in the bootstrap configuration for Envoy. Here's a documentation link for more details. It is done simply by adding overload-manager section as follows:
overload_manager:
refresh_interval: 0.25s
resource_monitors:
- name: "envoy.resource_monitors.fixed_heap"
typed_config:
"#type": type.googleapis.com/envoy.extensions.resource_monitors.fixed_heap.v3.FixedHeapConfig
# TODO: Tune for your system.
max_heap_size_bytes: 2147483648 # 2 GiB <==== fix this!
actions:
- name: "envoy.overload_actions.shrink_heap"
triggers:
- name: "envoy.resource_monitors.fixed_heap"
threshold:
value: 0.95
- name: "envoy.overload_actions.stop_accepting_requests"
triggers:
- name: "envoy.resource_monitors.fixed_heap"
threshold:
value: 0.98

Setting up a AWS cloudwatch alert when ElasticsearchRequests are too high

I am trying to setup a cloudwatch alert that if more than lets say 5000 http requests are sent to an AWS ES cluster using CloudFormation, I see there is the ElasticsearchRequests metric i can use and this is what i have so far:
ClusterElasticsearchRequestsTooHighAlarm:
Condition: HasAlertTopic
Type: 'AWS::CloudWatch::Alarm'
Properties:
AlarmActions:
- {'Fn::ImportValue': !Sub '${ParentAlertStack}-TopicARN'}
AlarmDescription: 'ElasticsearchRequests are too high.'
ComparisonOperator: GreaterThanThreshold
Dimensions:
- Name: ClientId
Value: !Ref 'AWS::AccountId'
- Name: DomainName
Value: !Ref ElasticsearchDomain
EvaluationPeriods: 1
MetricName: 'ElasticsearchRequests'
Namespace: 'AWS/ES'
OKActions:
- {'Fn::ImportValue': !Sub '${ParentAlertStack}-TopicARN'}
Period: 60
Statistic: Maximum
Threshold: 5000
Does this look correct?
Should I use SampleCount instead of Maximum for the Statistic?
Any advice is much appreciated
According to the AWS Doc about monitoring ELasticSearch/OpenSearch clusters, the relevant statistic for the metric ElasticsearchRequests is Sum.
Here is what the docs say:
OpenSearchRequests
The number of requests made to the Elasticsearch/OpenSearch cluster.
Relevant statistics: Sum

gke cluster deployment with custom network

I am trying to create a yaml file to deploy gke cluster in a custom network I created. I get an error
JSON payload received. Unknown name \"network\": Cannot find field."
I have tried a few names for the resources but I am still seeing the same issue
resources:
- name: myclus
type: container.v1.cluster
properties:
network: projects/project-251012/global/networks/dev-cloud
zone: "us-east4-a"
cluster:
initialClusterVersion: "1.12.9-gke.13"
currentMasterVersion: "1.12.9-gke.13"
## Initial NodePool config.
nodePools:
- name: "myclus-pool1"
initialNodeCount: 3
version: "1.12.9-gke.13"
config:
machineType: "n1-standard-1"
oauthScopes:
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
- https://www.googleapis.com/auth/ndev.clouddns.readwrite
preemptible: true
## Duplicates node pool config from v1.cluster section, to get it explicitly managed.
- name: myclus-pool1
type: container.v1.nodePool
properties:
zone: us-east4-a
clusterId: $(ref.myclus.name)
nodePool:
name: "myclus-pool1"
I expect it to place the cluster nodes in this network.
The network field needs to be part of the cluster spec. The top-level of properties should just be zone and cluster, network should be on the same indentation as initialClusterVersion. See more on the container.v1.cluster API reference page
Your manifest should look more like:
EDIT: there is some confusion in the API reference docs concerning deprecated fields. I offered a YAML that applies to the new API, not the one you are using. I've update with the correct syntax for the basic v1 API and further down I've added the newer API (which currently relies on gcp-types to deploy.
resources:
- name: myclus
type: container.v1.cluster
properties:
projectId: [project]
zone: us-central1-f
cluster:
name: my-clus
zone: us-central1-f
network: [network_name]
subnetwork: [subnet] ### leave this field blank if using the default network
initialClusterVersion: "1.13"
nodePools:
- name: my-clus-pool1
initialNodeCount: 0
config:
imageType: cos
- name: my-pool-1
type: container.v1.nodePool
properties:
projectId: [project]
zone: us-central1-f
clusterId: $(ref.myclus.name)
nodePool:
name: my-clus-pool2
initialNodeCount: 0
version: "1.13"
config:
imageType: ubuntu
The newer API (which provides more functionality and allows you to use more features including the v1beta1 API and beta features) would look something like this:
resources:
- name: myclus
type: gcp-types/container-v1:projects.locations.clusters
properties:
parent: projects/shared-vpc-231717/locations/us-central1-f
cluster:
name: my-clus
zone: us-central1-f
network: shared-vpc
subnetwork: local-only ### leave this field blank if using the default network
initialClusterVersion: "1.13"
nodePools:
- name: my-clus-pool1
initialNodeCount: 0
config:
imageType: cos
- name: my-pool-2
type: gcp-types/container-v1:projects.locations.clusters.nodePools
properties:
parent: projects/shared-vpc-231717/locations/us-central1-f/clusters/$(ref.myclus.name)
nodePool:
name: my-clus-separate-pool
initialNodeCount: 0
version: "1.13"
config:
imageType: ubuntu
Another note, you may want to modify your scopes, the current scopes will not allow you to pull images from gcr.io, some system pods may not spin up properly and if you are using Google's repository, you will be unable to pull those images.
Finally, you don't want to repeat the node pool resource in both the cluster spec and separately. Instead, create the cluster with a basic (default) node pool, for all additional node pools, create them as separate resources to manage them without going through the cluster. There are very few updates you can perform on a node pool, asside from resizing

Grok exporter show changes only after restart

We have configured Grok exporter to monitor errors from various system logs. But it seems changes are reflected once we restart the respective grok instance.
Please see the config.yml below:
global:
config_version: 2
input:
type: file
path: /ZAMBAS/logs/Healthcheck/EFT/eftcl.log
readall: true
poll_interval_seconds: 5
grok:
patterns_dir: ./patterns
metrics:
- type: gauge
name: EFTFileTransfers
help: Counter metric example with labels.
match: '%{WORD:Status}\s%{GREEDYDATA:FileTransferTime};\s\\%{WORD:Customer}\\%{WORD:OutboundSystem}\\%{GREEDYDATA:File};\s%{WORD:Operation};\s%{NUMBER:Code}'
value: '{{.Code}}'
cumulative: false
labels:
Customer: '{{.Customer}}'
OutboundSystem: '{{.OutboundSystem}}'
File: '{{.File}}'
Status: '{{.Status}}'
Operation: '{{.Operation}}'
FileTransferTime: '{{.FileTransferTime}}'
- type: gauge
name: EFTFileSuccessfullTransfers
help: Counter metric example with labels.
match: 'Success\s%{GREEDYDATA:Time};\s\\%{WORD:Customer}\\%{WORD:OutboundSystem}\\%{GREEDYDATA:File};\s%{WORD:Operation};\s%{NUMBER:Code}'
value: '{{.Code}}'
cumulative: false
- type: gauge
name: EFTFileFailedTransfers
help: Counter metric example with labels.
match: 'Failed\s%{GREEDYDATA:Time};\s\\%{WORD:Customer}\\%{WORD:OutboundSystem}\\%{GREEDYDATA:File};\s%{WORD:Operation};\s%{NUMBER:Code}'
value: '{{.Code}}'
cumulative: false
server:
port: 9845
Without restart it doesn't reflects correct matching patterns. Once I restart the grok instance it reflects perfectly.
Is there some parameter I am missing here ?
Thanks
Priyotosh
Simply change readall to false in input section will stop process lines multiple times when grok_exporter is restarted. Please see the docs on Github.