Google Video Intelligence API : inconsistent offset delta for Text Detection (and other features)

Google Video Intelligence API : inconsistent offset delta for Text Detection (and other features) - offset

In text_annotations results, most bounding boxes have an offset that is a multiple of 0.12. Lately we have found some offsets are now randomly multiples of 0.10 and, even worse, sometimes 0.01.
Exemple:
time_offset {
seconds: 10
nanos: 110100000
}
We also have tricky situations where rounding the value produces an outlying offset, here for example giving us 9.81 instead of the 9.80 we would need to stay consistent with the 0.10 deltas:
time_offset {
seconds: 9
nanos: 809800000
}
Could someone clarify the rules for the offsets as we are debugging blind trying to take into account all these cases.
What time deltas do we need to cater for and what features use what deltas ?

Related

How to persist previous data point when time range doesn't include a data point

TL;DR:
Can I get Grafana to show me the previous data point, when the currently selected time period does not have a data point? I have an example which sounds ridiculous, but at least it's simple to understand: I send data every 1 minute, and I wish to zoom into the last 30 seconds, and still see data. You may ask "why not just zoom out to 2 minutes" but the reason is that other data is on the same graph that has updated more often, and I wish to compare with that data. Also, for the more lengthy reasons below.
If not, how can I achieve what I want to achieve, see below?
Context
For a few years, I have been monitoring the water level in three of our basement sumps (which have pumps installed) by sending this data from Node-RED to InfluxDB, then visualising the sump levels in Grafana. I have set up three waterproof ultrasonic distance sensors, each pointed down a pipe that is inserted vertically into each sump. The water fills the pipe and the distance sensor, connected to an Arduino, sends me the reading. The Arduino also has other sensors connected (temp / humidity) and deals with distance calibrations to calculate the percent full of each sump. All this data is sent to Node-RED. In total, I am sending 4 values per sump: distance measurement in mm, percent full, temp, humidity. So that's 12 fields. Data is sent every 2 seconds, because I wished to have a reasonably high resolution to see nice curves in graphs.
Also I decided to store all this data so that I could later troubleshoot issues (we have had sewage floods resulting in water not being able to be pumped away, etc...) and design some warning systems for these issues based on data.
Storing 12 values for every 2 seconds, over the course of a number of years, takes up a lot of space (8GB).
Nature of the data
Storing this resolution of data has also helped me be able to describe the nature of the data. I will do so here.
(1) Non-meaningful NOISE (see below) - the percent-full reading goes up and down by 1 or 2 percent every couple of seconds:
(2) Meaningful DRIFT (see below) - I don't mean sensor drift, I am referring to actual water levels changing slowly over time, e.g. over 1 day or 1 week. Perhaps condensation on the walls drips down into the sump, or water evaporates from the sump, and the value can waver by a few percent over the course of a day. Each sump has slightly different characteristics.
(3) Meaningful MONITORING DATA - during wet weather, depending on rainfall amount, the sumps fill up over the course of say 30 mins to 3 hours. Then the pumps run and the water level drops again, wavers a bit, then the sumps continue to fill up. If the rain stopped, you can see a lovely curve as the water fills in progressively more slowly (see the green line below):
Solution to downsample
I know Influx has its own downsampling possibilities, however because of the nature of the data (which can hardly vary for 2 months but when it does, I really need to capture it in detail), I don't think lowering the sample rate is a great idea.
I have some understanding of digital filters (e.g. low pass etc) but have never programmed one myself. So I have written a basic filter in javascript (a Node-RED function) to filter the data in realtime as follows: only send each reading when it has changed from the previous one by x amount. (And update the previous one, when that occurs.)
This has already vastly reduced the amount of data being stored, and I can vary x to filter out noise shown in my first graph above, at the expense of resolution when the pumps run. Even if I set the x value to 2, it still vastly reduces data over long periods of dry weather.
So - onto my problem! Now data is not being logged to InfluxDB unless there is some meaningful change. Which means that when I zoom in to e.g. 15 minute timeframe of data, there is nothing to see.
Grafana does have the option of "fill (previous)" but this draws a line between points on the existing graph, rather than showing the previous data as if it hasn't changed since that point. Now my grafana dashboard looks a bit sad :(
One proposed solution is, in addition to sending "delta" data, send "summary" data, that is - send a full suite of data every 1 minute regardless of whether data changed or not. But then we get noise back again, and pointless storage.
Any other ideas?

How can a Neural Network learn from testing outputs against external conditions which it can not directly control

In order to simplify the question and hopefully the answer I will provide a somewhat simplified version of what I am trying to do.
Setting up fixed conditions:
Max Oxygen volume permitted in room = 100,000 units
Target Oxygen volume to maintain in room = 100,000 units
Maximum Air processing cycles per sec == 3.0 cycles per second (min is 0.3)
Energy (watts) used per second is this formula : (100w * cycles_per_second)SQUARED
Maximum Oxygen Added to Air per "cycle" = 100 units (minimum 0 units)
1 person consumes 10 units of O2 per second
Max occupancy of room is 100 person (1 person is min)
inputs are processed every cycle and outputs can be changed each cycle - however if an output is fed back in as an input it could only affect the next cycle.
Lets say I have these inputs:
A. current oxygen in room (range: 0 to 1000 units for simplicity - could be normalized)
B. current occupancy in room (0 to 100 people at max capacity) OR/AND could be changed to total O2 used by all people in room per second (0 to 1000 units per second)
C. current cycles per second of air processing (0.3 to 3.0 cycles per second)
D. Current energy used (which is the above current cycles per second * 100 and then squared)
E. Current Oxygen added to air per cycle (0 to 100 units)
(possible outputs fed back in as inputs?):
F. previous change to cycles per second (+ or - 0.0 to 0.1 cycles per second)
G. previous cycles O2 units added per cycle (from 0 to 100 units per cycle)
H. previous change to current occupancy maximum (0 to 100 persons)
Here are the actions (outputs) my program can take:
Change cycles per second by increment/decrement of (0.0 to 0.1 cycles per second)
Change O2 units added per cycle (from 0 to 100 units per cycle)
Change current occupancy maximum (0 to 100 persons) - (basically allowing for forced occupancy reduction and then allowing it to normalize back to maximum)
The GOALS of the program are to maintain a homeostasis of :
as close to 100,000 units of O2 in room
do not allow room to drop to 0 units of O2 ever.
allows for current occupancy of up to 100 people per room for as long as possible without forcibly removing people (as O2 in room is depleted over time and nears 0 units people should be removed from room down to minimum and then allow maximum to recover back up to 100 as more and more 02 is added back to room)
and ideally use the minimum energy (watts) needed to maintain above two conditions. For instance if the room was down to 90,000 units of O2 and there are currently 10 people in the room (using 100 units per second of 02), then instead of running at 3.0 cycles per second (90 kw) and 100 units per second to replenish 300 units per second total (a surplus of 200 units over the 100 being consumed) over 50 seconds to replenish the deficit of 10,000 units for a total of 4500 kw used. - it would be more ideal to run at say 2.0 cycle per second (40 kw) which would produce 200 units per second (a surplus of 100 units over consumed units) for 100 seconds to replenish the deficit of 10,000 units and use a total of 4000 kw used.
NOTE: occupancy may fluctuate from second to second based on external factors that can not be controlled (lets say people are coming and going into the room at liberty). The only control the system has is to forcibly remove people from the room and/or prevent new people from coming into the room by changing the max capacity permitted at that next cycle in time (lets just say the system could do this). We don't want the system to impose a permanent reduction in capacity just because it can only support outputting enough O2 per second for 30 people running at full power. We have a large volume of available O2 and it would take a while before that was depleted to dangerous levels and would require the system to forcibly reduce capacity.
My question:
Can someone explain to me how I might configure this neural network so it can learn from each action (Cycle) it takes by monitoring for the desired results. My challenge here is that most articles I find on the topic assume that you know the correct output answer (ie: I know A, B, C, D, E inputs all are a specific value then Output 1 should be to increase by 0.1 cycles per second).
But what I want is to meet the conditions I laid out in the GOALS above. So each time the program does a cycle and lets say it decides to try increasing the cycles per second and the result is that available O2 is either declining by a lower amount than it was the previous cycle or it is now increasing back towards 100,000, then that output could be considered more correct than reducing cycles per second or maintaining current cycles per second. I am simplifying here since there are multiple variables that would create the "ideal" outcome - but I think I made the point of what I am after.
Code:
For this test exercise I am using a Swift library called Swift-AI (specifically the NeuralNet module of it : https://github.com/Swift-AI/NeuralNet
So if you want to tailor you response in relation to that library it would be helpful but not required. I am more just looking for the logic of how to setup the network and then configure it to do initial and iterative re-training of itself based on those conditions I listed above. I would assume at some point after enough cycles and different conditions it would have the appropriate weightings setup to handle any future condition and re-training would become less and less impactful.

This is a control problem, not a prediction problem, so you cannot just use a supervised learning algorithm. (As you noticed, you have no target values for learning directly via backpropagation.) You can still use a neural network (if you really insist). Have a look at reinforcement learning. But if you already know what happens to the oxygen level when you take an action like forcing people out, why would you learn such a simple facts by millions of evaluations with trial and error, instead of encoding it into a model?
I suggest to look at model predictive control. If nothing else, you should study how the problem is framed there. Or maybe even just plain old PID control. It seems really easy to make a good dynamical model of this process with few state variables.
You may have a few unknown parameters in that model that you need to learn "online". But a simple PID controller can already tolerate and compensate some amount of uncertainty. And it is much easier to fine-tune a few parameters than to learn the general cause-effect structure from scratch. It can be done, but it involves trying all possible actions. For all your algorithm knows, the best action might be to reduce the number of oxygen consumers to zero permanently by killing them, and then get a huge reward for maintaining the oxygen level with little energy. When the algorithm knows nothing about the problem, it will have to try everything out to discover the effect.

How should i interpret this grafana visualized prometheus histogram buckets heatmap?

I visualized prometheus histogram buckets as heatmap with grafana, below pic shows the query and the outcome graph, how should i interpret this?
According to my attacker, in total i sent 300 requests in that period exactly, but when i sum those numbers up on above graph i can never get exact 300,
and also looks those numbers are fluctuating with the time elapsing, how should i interpret this graph in a meaningful way?
And if i want those numbers to be the exact request counts locate in each of those bucket in that time window, what should i do?
Oh, for the X-Axis Mode i chose Series and the Value i chose Current.

There are real reasons why you can't always get a precise rate/increase value out of Prometheus. One of them is failed scrapes, i.e. every now and then a scrape will fail or time out due to a slow service, slow Prometheus or network issue.
The other reason is the fact that collected samples are never exactly scrape_interval apart: there will always be a few milliseconds or seconds of delay here and there. So (to take an extreme example) how can you tell the precise increase over the past 1 minute if you only have 2 samples 63 seconds apart? Is it the difference between the two values? Is it that difference adjusted to 60 seconds (i.e. / 63 * 60)?
That being said, Prometheus further boxes itself into a corner by only looking at samples falling strictly within the requested time range. To explain myself: how would a reasonable person calculate the increase of a counter over the last 30 minutes? They would likely take the value of said counter now and the value 30 minutes ago and subtract them. I.e. in PromQL terms (adjusting for counter resets where necessary):
request_duration_bucket - request_duration_bucket offset 30m
What Prometheus does instead (assuming a scrape_interval of 1m and an ideal timeseries with samples spaced exactly 1m apart) is essentially this:
(request_duration_bucket - request_duration_bucket offset 29m) / 29 * 30
I.e. it takes the increase over 29 minutes and extrapolates it to 30. Because of self-imposed limitations, nothing to do with the nature of the problem at hand.
Note that this works fine with counters that increase smoothly and continuously. E.g. if you have a counter that increases by 500 every minute, then taking the increase over 29 minutes and extrapolating to 30 is exactly correct. But for anything that increases in jumps and fits (which is most real-life counters) it will either slightly overestimate the increase if it occurs during the 29 minutes it actually samples (by exactly 1/29) or seriously underestimate it (if the increase occurs in the 1 minute not included in the sampling). This is even worse if you compute a rate/increase over a range covering fewer samples. E.g. if your range only covers 5 samples on average, the overestimate will be 20%, i.e. 1 / (5 - 1) and (each of) your increases will totally disappear 1 minute out of 5.
The only way I've found to work around this limitation is (again, assuming a scrape_interval of 1m) to reverse engineer Prometheus' extrapolation:
increase(request_duration_bucket[31m]) / 31 * 30
But this requires you to be aware of your scrape_interval and adjust for it and is very brittle (if you ever change your scrape_interval all your careful tweaking goes to hell).
Or, if you are OK with your increase falling to zero every time an instance is restarted:
clamp_min(request_duration_bucket - request_duration_bucket offset 30m, 0)
I do actually have a proposed patch to Prometheus to add xrate/xincrease functions that actually behave more as you would expect them to (and as described above) but it doesn't look very likely to be accepted: https://github.com/prometheus/prometheus/issues/3806

Alert in RAM/CPU Usage Detection in e-Commerce Server

Currently I'm building my monitoring services for my e-commerce Server, which mostly focus on CPU/RAM usage. It's likely Anomaly Detection on Timeseries data.
My approach is building LSTM Neural Network to predict next CPU/RAM value on chart trending and compare with STD (standard deviation) value multiply with some number (currently is 10)
But in real life conditions, it depends on many differents conditions, such as:
1- Maintainance Time (in this time "anomaly" is not "anomaly")
2- Sales time in day-off events, holidays, etc., RAM/CPU usages increase is normal, of courses
3- If percentages of CPU/RAM decrement are the same over 3 observations: 5 mins, 10 mins & 15 mins -> Anomaly. But if 5 mins decreased 50%, but 10 mins it didn't decrease too much (-5% ~ +5%) -> Not an "anomaly".
Currently I detect anomaly on formular likes this:
isAlert = (Diff5m >= 10 && Diff10m >= 15 && Diff30m >= 40)
where Diff is Different Percentage in Absolute value.
Unfortunately I don't save my "pure" data for building neural network, for example, when it detects anomaly, I modified that it is not an anomaly anymore.
I would like to add some attributes to my input for model, such as isMaintenance, isPromotion, isHoliday, etc. but sometimes it leads to overfitting.
I also want to my NN can adjust baseline over the time, for example, when my Service is more popular, etc.
There are any hints on these aims?
Thanks

I would say that an anomaly is an unusual outcome, i.e. a outcome that's not expected given the inputs. As you've figured out, there are a few variables that are expected to influence CPU and RAM usage. So why not feed those to the network? That's the whole point of Machine Learning. Your network will make a prediction of CPU usage, taking into account the sales volume, whether there is (or was) a maintenance window, etc.
Note that you probably don't need an isPromotion input if you include actual sales volumes. The former is a discrete input, and only captures a fraction of the information present in the totalSales input
Machine Learning definitely needs data. If you threw that away, you'll have to restart capturing it. As for adjusting the baseline, you can achieve that by overweighting recent input data.

Calculating IV60, and IV90 on interactive brokers

I am trading options, but I need to calculate the historical implied volatility in the last year. I am using Interactive Broker's TWS. Unfortunately they only calculate V30 (the implied volatility of the stock using options that will expire in 30 days). I need to calculate the implied volatility of the stock using options that will expire in 60 days, and 90 days.
The problem: Calculate the implied volatility of at least a whole year of an individual stock using options that will expire in 60 days and 90 days giving that:
TWS does not provide V60 or V90.
TWS does not provide historical pricing data for individual options for more than 3 months.
The attempted solution:
Use the V30 that TWS provide too come up with V60 and V90 giving the fact that usually option prices will behave like a skew (horizontal skew). However, the problem to this attempted solution is that the skew does not always have a positive slope, so I can't come up with a mathematical solution to always correctly estimate IV60 and IV90 as this can have a positive or negative slope like in the picture below.
Any ideas?

Your question is either confusing or isn't about programming. This is what IB says.
The IB 30-day volatility is the at-market volatility estimated for a
maturity thirty calendar days forward of the current trading day, and
is based on option prices from two consecutive expiration months.
It makes no sense to me and I can't even get those ticks to arrive (generic type 24). But even if you get them, they don't seem to be useful. My guess is it's an average to estimate what the IV would be for an option expiring exactly 30 days in the future. I can't imagine the purpose for this. The data would be impossible to trade with and doesn't represent reality. Imagine an earnings report at 29 or 31 days!
If you'd like the IV about 60 or 90 days in the future call reqMktData with an option contract that expires around then and an empty generic tick list. You will get tick types 10, 11, 12, and 13 which all have an IV. That's how you build the IV surface. If you'd like to build it with a weighted average to estimate 60 days, it's possible.
This is python but should be self explanatory
tickerId = 1
optCont = Contract()
optCont.m_localSymbol = "AAPL 170120C00130000"
optCont.m_exchange = "SMART"
optCont.m_currency = "USD"
optCont.m_secType = "OPT"
tws.reqMktData(tickerId, optCont, "", False)
Then I get data like
<tickOptionComputation tickerId=1, field=10, impliedVol=0.20363398519176756, delta=0.0186015418248492, optPrice=0.03999999910593033, pvDividend=0.0, gamma=0.007611155331932943, vega=0.012855970569816431, theta=-0.005936076573849303, undPrice=116.735001>
If there's something I'm missing about options, you should ask this at https://quant.stackexchange.com/

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse