How does the Graphite summarize function with avg work? - average

I'm trying to figure out how the Graphite summarize function works. I've the following data points, where X-axis represents time, and Y-axis duration in ms.
+-------+------+
| X | Y |
+-------+------+
| 10:20 | 0 |
| 10:30 | 1585 |
| 10:40 | 356 |
| 10:50 | 0 |
+-------+------+
When I pick any time window on Grafana more than or equal to 2 hours (why?), and apply summarize('1h', avg, false), I get a triangle starting at (9:00, 0) and ending at (11:00, 0), with the peak at (10:00, 324).
A formula that a colleague came up with to explain the above observation is as follows.
Let:
a = Number of data points for a peak, in this case 4.
b = Number of non-zero data points, in this case 2.
Then avg = sum / (a + b). It produces (1585+356) / 6 = 324 but doesn't match with the definition of any mean I know of. What is the math behind this?

Your data is at 10 minute intervals, so there are 6 points in each 1hr period. Graphite will simply take the sum of the non-null values in each period divided by the count (standard average). If you look at the raw series you'll likely find that there are also zero values at 10:00 and 10:10

Related

Compute similarity in pyspark

I have a csv file contains some data, I want select the similar data with an input.
my data is like:
H1 | H2 | H3
--------+---------+----------
A | 1 | 7
B | 5 | 3
C | 7 | 2
And the data point that I want find data similar to that in my csv is like : [6, 8].
Actually I want find rows that H2 and H3 of data set is similar to input, and It return H1.
I want use pyspark and some similarity measure like Euclidean Distance, Manhattan Distance, Cosine Similarity or machine learning algorithm.

Is there any way to compare two table calculations made in Tableau to create a calculated field?

I am relatively new to tableau! I am working on some problem which requires me to compare a table calculation to specified thresholds. I have five time windows namely 0-30, 30-60, 60-90, 90-120 and 120 onwards, to categorize my data into. This is spread across the data. I calculate the number of events which happen within certain time windows by doing '{FIXED [time window] : count([time window])}'. Thus I got a count for all the categories as 50 events happened which lasted 0-30s, 30 events lasted 30-60s, 10 events lasted for 60-90s and 5 events each for the rest of the two classes. I have a restriction of cumulative percentages as: 75, 90, 95, 97.5, 100.
I have created this using IF, ELSEIF and ELSE statements like:
IF time window = '0-30s' THEN 75
ELSEIF time window = '30-60s' THEN 90
ELSEIF time window = '60-90s' THEN 95
ELSEIF time window = '90-120s' THEN 97.5
ELSE 100.
and named this as specified cumulative share.
I make a table calculation for the obtained values as percent of total of Running total of the count of events in each class using primary and secondary table calculations for the measure and thus have got 50%, 80%, 90%, 95% and 100% for the respective classes. Now I need to compare each of them with the specified share and create another calculated field saying greater than, equal to or less than. How do I do it?
The current table looks like this:
**Time window** | **Obtained cumulative share** | **Specified cumulative share**
0 - 30 s | 50 % | 75
30 - 60 s | 80 % | 90
60 - 90 s | 90 % | 95
90 - 120 s | 95 % | 97.5
120 onwards | 100 % | 100
**Obtained cumulative share** is an alias for percent of total ( running total (counts for each
class))
I created a sample data and did it like this-
Instead of calculating cumulative sum through table calculation methood, use a function running_sum like this-
RUNNING_SUM(SUM([Count of Class] ))
I named this field as calculated cum sum.
create another calculated field for your T/F condition
MIN([specified Cum Share])>=([calculated cum share])
I have tweaked your specified shares just to check the formula is correct. See this view that it works.

matlab spectrum returns more FRAME than expected

I'm using the following code to get specgram2D from np array:
specgram2D, freq, time = mlab.specgram(samples, Fs=11025, NFFT=1024, window=mlab.window_hanning, noverlap=int(1024 * 0.5))
Then I print out specgram2D like
print len(specgram2D) # returns 513
I got 513 instead of expected 512 which is half the window size.
What am I doing wrong?
Can I just ignore specgram2D[512]?
I got 513 instead of expected 512 which is half the window size.
What am I doing wrong?
For a real-valued signal, the frequency spectrum obtained from the Discrete Fourier Transform (DFT) is symmetric and hence only half of the spectrum is necessary to describe the entire spectrum (since the other half can be obtained from symmetry). That is probably why you are expecting the size to be exactly half the input window size of 1024.
The problem is that with even sized inputs, the midpoint of the spectrum falls exactly on a frequency bin. As a result, that frequency bin is its own symmetry. To illustrate this, the symmetry can be seen from the following graph:
frequency: 0 fs/N ... fs/2 ... fs
bin number: 0 1 ... 511 512 513 ... 1023 1024
^ ^ ^ ^ ^ ^ ^ ^
| | | |-| | | |
| | | | | |
| | |--------| | |
| | | |
| |----------------------------| |
| |
|--------------------------------------|
Where N is the size of the FFT (as determined by the NFFT=1024 parameter) and fs is the sampling frequency. As you can see the spectrum is fully specified by taking bins 0 to 512, inclusive. Correspondingly you should be expecting the size to be floor(N/2)+1 (simply N/2 + 1 with integer division, but I included the floor to emphasis the round down operation), or 513 in your case.
Can I just ignore specgram2D[512]?
As previously shown it is an integral part of the spectrum, but many applications do not specifically require every single frequency bins (i.e. ignoring that bin depends on whether your application is mostly interested in other frequency components).

Simulation of custom probability distribution in Matlab

I'm trying to simulate the following distribution:
a | 0 | 1 | 7 | 11 | 13
-----------------------------------------
p(a) | 0.34 | 0.02 | 0.24 | 0.29 | 0.11
I already simulated a similar problem: four type of balls with chances of 0.3, 0.1, 0.4 and 0.2. I created a vector F = [0 0.3 0.4 0.8 1] and used repmat to grow it by 1000 rows. Then I compared it with a columnvector of 1000 random numbers grown with 5 columns using the same repmat approach. I compared those two, calculated the sumvector of the matrix, and calculated the difference to get the frequences (e.g. [301 117 386 196]). .
But with the current distribution I don't know how to create the initial matrix F and whether I can use the same approach I used before at all.
I need the answer to be "vectorised", so no (for, while or if) loops.
This question on math.stackexchange
What if you create the following arrays:
largeNumber = 1000000;
a=repmat( [0], 1, largeNumber*0.34 );
b=repmat( [1], 1, largeNumber*0.02 );
% ...
e=repmat( [13], 1, largeNumber*0.11 );
Then you concatenate all of these arrays (to get a single array where your entries are represented with their corresponding probabilities), shuffle them, and extract the first N elements to get an N-dimensional vector drawn from your distribution.
EDIT: of course this answer is the way to go.

gnuplot, two y-ranges far apart

Is it possible to plot two ranges which are far apart each other?
I mean, if I have a dataset like [ 1, 2, 3, 1001, 1001, 1003 ],
can I draw a plot like this?
|
1003 | x
1002 | x
1001 | x
1000 |
|
===================== omission
|
4 |
3 | x
2 | x
1 | x
-------------
You may want to check out this link:
Gnuplot surprising - Broken axes graph in gnuplot. The author presents three examples of plotting a grqph with a broken x axis.
Three helpful examples:
http://gnuplot-surprising.blogspot.com/2011/10/broken-axes-graph-in-gnuplot-3.html
http://www.phyast.pitt.edu/~zov1/gnuplot/html/broken.html
http://www.phyast.pitt.edu/~zov1/
It is not straightforward.