Argo Workflow use podSpecPatch to dynamically increase memory resource request in retries - argo-workflows

I'm looking for a way to dynamically increase memory resource requests in Argo Workflow retries, based on what is described here.
However, in my case, the original memory request parameters need to be taken from input parameters.
I have input parameters memreqnum (e.g. 250) and memrequnit (e.g. Mi), and I'm trying to define a podSpecPatch
that will enable me to specify the memory request as a multiple of memreqnum (based on retries) with the original
memrequnit.
I have made all kinds of attempts, such as the following, but none seem to work:
podSpecPatch: |
containers:
- name: main
resources:
requests:
memory: "{{={{=(asInt{{retries}} + 1)}} * {{inputs.parameters.memreqnum}}}}{{inputs.parameters.memrequnit}}"
The above podSpecPatch gives me the following error:
invalid podSpecPatch "{\"containers\":[{\"name\":\"main\",\"resources\":{\"requests\":{\"memory\":\"{{={{=(asInt{{retries}} + 1)}} * 250}}Mi\"}}}]}": quantities must match the regular expression '^([+-]?[0-9.]+)([eEinumkKMGTP]*[-+]?[0-9]*)$'
Can someone please point me in the right direction?
Many thanks!

Related

patch endpoint naming and realisation

We have rest resource
/tasks/{task-type}
and only GET methods available.
GET /tasks/{task-type}
GET /tasks/{task-type}/{id}
Task entity contains meta info like created, finished, status, ref key and try counts for scheduled tasks.
Now we faced with problem, when task may contains incorrect data and its execution always failed.
Due to scheduler invoked tasks every 5 min there are a lot of errors in logs and largest try counts around 500k. The solution i found is to limit try_count to five (for example). And now we need way to manual discard try-count to zero. So i found two solutions:
1.
PATCH /tasks/{task-type}/{id}/discard-try-count - no response body
This solution look pretty simple, but violates the REST convention, because we use action(verb) in naming. But if we need to change other fields, then we will make a lot of endpoints in this style.
2a.
PATCH /tasks/{task-type}/{id}
body:
{
"tryCounts": int
}
This looks like REST want to see it and we can easy add new fields to modify, but now client can set any value for tryCount.
2b
PATCH /tasks/{task-type}/{id}
body:
{
"tryCounts": int // validate that try count can be only zero
}
Differs from the previous one by the presence of validation.
This looks like the most reliable solution. Is it really the best fit?
The non-verb convention is not a standard, you can violate it if you want to, though it can be worked around with very simple stuff, just convert the verb into a noun and you will be ok, something like:
POST /tasks/{task-type}/{id}/try-count-discarding
Another way is setting the try count to zero:
PUT /tasks/{task-type}/{id}/try-count 0
Yet another solution is combining the two, which I like the most:
PATCH /tasks/{task-type}/{id}/try-count {"op": "reset"}
Or another variant:
PATCH /tasks/{task-type}/{id} {"op": "discard-try-count"}

How to join 2 sets of Prometheus metrics?

AKS = 1.17.9
Prometheus = 2.16.0
kube-state-metrics = 1.8.0
My use case: I want to alert when 1 of my persistent volumes are not in a "Bound" phase and only when this falls within a predefined set of namespaces.
This got me to my first attempt at joining Prometheus metrics - so, please bear with me : )
I opted to use the following to obtain the pv phase:
kube_persistentvolume_status_phase{phase="Bound",job="kube-state-metrics"}
Renders:
kube_persistentvolume_status_phase{instance="10.147.5.110:8080",job="kube-state-metrics",persistentvolume="pvc-33197ae6-d42a-777e-b8ca-efbd66a8750d",phase="Bound"} 1
kube_persistentvolume_status_phase{instance="10.147.5.110:8080",job="kube-state-metrics",persistentvolume="pvc-165d5006-erd4-481e-8acc-eed4a04a3bce",phase="Bound"} 1
This worked well, except for the fact that it does not include the namespace.
So I managed to determine the persistentvolumeclaim namespaces with this:
kube_persistentvolumeclaim_info{namespace=~"monitoring|vault"}
Renders:
kube_persistentvolumeclaim_info{instance="10.147.5.110:8080",job="kube-state-metrics",namespace="vault",persistentvolumeclaim="vault-file",storageclass="default",volumename="pvc-33197ae6-d42a-777e-b8ca-efbd66a8750d"} 1
kube_persistentvolumeclaim_info{instance="10.147.5.110:8080",job="kube-state-metrics",namespace="monitoring",persistentvolumeclaim="prometheus-prometheus-db-prometheus-prometheus-0",storageclass="default",volumename="pvc-165d5006-erd4-481e-8acc-eed4a04a3bce"} 1
So my idea was to join these sets with the matching values in the following fields:
(kube_persistentvolume_status_phase)persistentvolume
on
(kube_persistentvolumeclaim_info)volumename  
BUT, if I understood it correctly you are only able to join two metrics sets on labels that match exactly (text and their values). I hence opted for the "instance" and "job" labels as these were common on both sides and matching. 
kube_persistentvolume_status_phase{phase!="Bound",job="kube-state-metrics"}  * on(instance,job) group_left(namespace) kube_persistentvolumeclaim_info{namespace=~"monitoring|vault"}
Renders:
Error executing query: found duplicate series for the match group {instance="10.147.5.110:8080" , job="kube-state-metrics"} on the right hand-side of the operation: [{__name__="kube_persistentvolumeclaim_info", instance="10.147.5.110:8080", job="kube-state-metrics", namespace="monitoring", persistentvolumeclaim="alertmanager-prometheusam-db-alertmanager-prometheusam-0", storageclass="default", volumename="pvc-b8406fb8-3262-7777-8da8-151815e05d75"}, {__name__="kube_persistentvolumeclaim_info", instance="10.147.5.110:8080", job="kube-state-metrics", namespace="vault", persistentvolumeclaim="vault-file", storageclass="default", volumename="pvc-33197ae6-d42a-777e-b8ca-efbd66a8750d"}];many-to-many matching not allowed: matching labels must be unique on one side
So in all fairness, the query does communicate well on what the problem is - so I attempted to solve this with the "ignoring" option - attempting to keep only the matching labels and values (instance and job) and "excluding/ignoring" the non-matching ones on both sides. This did not work either - resulting in a parsing error. Which in turn nudged me to take a step back and reassess what I am doing.
I am just a bit concerned that I am perhaps barking up the wrong tree here.
My question is: Is this at all possible and if so how? or is there perhaps another, more prudent way to achieve this?
Thanks in advance!

Narrow down a whole metric to an instance in grafana

If I have a prometheus metric as
((node_memory_MemTotal - node_memory_MemFree - node_memory_Buffers - node_memory_Cached) / node_memory_MemTotal) * 100
How can I apply that only to the current $instance ? I have tried surrounding it in brackets and adding:
{instance="$instance"}
(For which I have declared a variable), but it doesn't like it. Surely I don't have to repeat it after every metric name?
Surely I don't have to repeat it after every metric name?
Yes, for the best performance you should use the full selector with each metric:
((node_memory_MemTotal{instance="$instance"}
- node_memory_MemFree{instance="$instance"}
- node_memory_Buffers{instance="$instance"}
- node_memory_Cached{instance="$instance"}
) / node_memory_MemTotal{instance="$instance"}
) * 100

Azure APIM Policy Editor

I would very much like to be able to set Azure API Policy attributes based on a User's Jwt Claims data. I have been able to set string values for things like the counter-key and increment-condition but I can't set all attributes. I imagined doing something like the following:
<rate-limit-by-key
calls="#((int) context.Variables["IdentityToken"].AsJwt().Claims.GetValueOrDefault("/LimitRate/Limit", "5"))"
renewal-period="#((int) context.Variables["IdentityToken"].AsJwt().Claims.GetValueOrDefault("/LimitRate/Duration/InSeconds", "60"))"
counter-key="#((string)context.Variables["Subject"])"
increment-condition="#(context.Response.StatusCode == 200)"
/>
However there seems to be some validation happening when I save the policy as I get the following error:
Error in element 'rate-limit-by-key' on line 98, column 10: The 'calls' attribute is invalid - The value '#((int) context.Variables["IdentityToken"].AsJwt().Claims.GetValueOrDefault("/LimitRate/Limit", "5"))' is invalid according to its datatype 'http://www.w3.org/2001/XMLSchema:int' - The string '#((int) context.Variables["IdentityToken"].AsJwt().Claims.GetValueOrDefault("/LimitRate/Limit", "5"))' is not a valid Int32 value.
I even have trouble setting a string parameter (albeit one with a strict format)
<quota-by-key
calls="10"
bandwidth="100"
renewal-period="#((string) context.Variables["IdentityToken"].AsJwt().Claims.GetValueOrDefault("/Quota/RenewalPeriod", "P00Y00M01DT00H00M00S"))"
counter-key="#((string)context.Variables["Subject"])"
/>
Which gives the following when I try and save the policy:
Error in element 'quota-by-key' on line 99, column 6: #((string) context.Variables["IdentityToken"].AsJwt().Claims.GetValueOrDefault("/Quota/RenewalPeriod", "P00Y00M01DT00H00M00S")) is not in a valid format. Provide number of seconds or use 'PxYxMxDTxHxMxS' format where 'x' is a number.
I have tried a large set of variations casting, Convert.ToInt32, claims that are not strings, #{return 5}, #(5) etc but there seems to be some validation happening at save time that is stopping it.
Is there away around this issue as I think it would be a useful feature to add to my API?
calls attribute on rate-limit-by-key and quota-by-key does not support policy expressions. Internal limitations block us from treating it on per-request basis unfortunately. The best you can do is categorize requests into a few finite groups and apply rate limit/quota conditionally using choose policy.
Or try using increment-count attribute to control by how much counter is increased per each request.

What is the best way to ensure the correctness of data returned by a SNMP query?

I am working on code which uses the snmp->get_bulk_request() method to make SNMP queries to get interface table details from a network device.
The problem I am facing is that sometimes, the data I receive from the query is missing some detail. This is a transient issue.
I believe that placing a set number of retries will reduce the probability of error. But, as I go through the documentation for snmp->get_bulk_request(), I find a parameter called
maxrepetitions. It is not clear to me from the documentation what this parameter does.
I am trying to figure out what effect the maxrepetitions parameter has when used with the get_bulk_request call method. I have gone through the documentation in "get_bulk_request() - send a SNMP get-bulk-request to the remote agent" and found this:
$result = $session->get_bulk_request(
[-callback => sub {},] # non-blocking
[-delay => $seconds,] # non-blocking
[-contextengineid => $engine_id,] # v3
[-contextname => $name,] # v3
[-nonrepeaters => $non_reps,]
[-maxrepetitions => $max_reps,]
-varbindlist => \#oids,
);
The default value for get-bulk-request -maxrepetitions is 0. The maxrepetitions value specifies the number of successors to be returned for the remaining variables in the variable-bindings list.
Specifically, my questions are:
Is adding maxrepetitions equivalent to adding retries for the query?.
Is retrying the right way to ensure the data is most probably correct?
If not, what is the best method to ensure the probability error is low in data returned by SNMP query?
From the man page:
Set the max-repetitions field in the GETBULK PDU. This specifies the maximum number of iterations over the repeating
variables.
Example
snmpbulkget -v2c -Cn1 -Cr5 -Os -c public zeus system ifTable
will retrieve the variable system.sysDescr.0 (which is the lexicographically next object to system) and the first 5 objects in
the ifTable:
sysDescr.0 = STRING: "SunOS zeus.net.cmu.edu 4.1.3_U1 1 sun4m"
ifIndex.1 = INTEGER: 1
ifIndex.2 = INTEGER: 2
ifDescr.1 = STRING: "lo0"
et cetera.