How to turn prometheus irate function to sql - postgresql

I need to turn prometheus irate function to sql language, and i cannot really find the calculation logic anywhere.
i have the following query in prometheus sql:
100 - (avg by (instance) (irate(node_cpu_seconds_total{job="node",mode="idle"}[40s])) * 100)
Let's say i have the following data for a cpu:
v 20 50 100 200 201 230
----x-+----x------x-------x-------x--+-----x-----
t 10 20 30 40 50 60
| <-- range=40s -->|
t
My question is not really related to postgres, since i could solve this problem in sql if i would know what is the formula i should develop.
i understand that i have to get the last two datapoints difference and divide value_diff with time_diff:
(201-200)/(50-40), but how the 40s window comes into the picture?
((201-200)/(50-40))/40 ?
What would be the proper mathematical calculation for the above prometheus query?
And how i should do the same if i have 8 cpu data?
I tried to search for documentation, but could not find any proper explanation what is going on behind.
Thanks

Related

Tableau - Calculated field of different columns based on different partition of the same table

Sorry for the stupid question.
Situation: I have a partitioned table (the partition is the week of the year) with some metrics (e.g. frequency of some keywords); I need to run an analysis of metrics belonging to different partitions (e.g. the trend between the frequency of a keyword in week 32 compared to week 3). The ultimate purpose is to create a dashboard where the user can choose the week of the year and is presented with the calculated analysis on the go.
So far I have used a live query that uses two parameters (week_1 and week_2) that joins data from the same table based on the two different parameters. You can imagine that the dashboard recomputes everything once one of the parameter is changed by the user. To avoid long waiting times, I have set the two parameters to a non-existent default value (0, zero), so that the dashboard can open very quickly. Then I prompt the user to stop the dashboard, insert the new parameters of choice, and then restart the dashboard to load the new computations.
My question is: is it possible to achieve the same by using an extract of the table? The table itself should not be excessively big (it should be 15 million records spanning 3 years) and as far as I know the extracts are performant with those numbers.
I am quite new to Tableau, so I would like to know from more expert people if there is a more optimal way to do such a thing without using live queries.
Please, feel free to ask more information if I was not clear! However, I cannot share my workbook, as it contains sensitive information.
Edit:
+-----------+ -----------+ ------------+
partition keyword frequency
+-----------+ -----------+ ------------+
202032 hello 5000
202032 ciao 567
...
202031 hello 2323
202031 ciao 34567
...
20203 hello 2
20203 ciao 1000
With the live query, I can join the table where partition = 202032 with the same table where partition - 20203 and make a new table with a column where I compute e.g. a trend between the two frequencies:
+----------+ -----------------------+ ---------------+
keyword partitions_compared trend
+----------+ -----------------------+ ---------------+
hello 202032 - 20203 +1billion %
ciao 202032 - 20203 +1K %
With the live query I join on the keywords.
Thanks a lot in advance and have a great day!
Cheers

Selecting multiple values/aggregators from Influxdb, excluding time

I got a influx db table consisting of
> SELECT * FROM results
name: results
time artnum duration
---- ------ --------
1539084104865933709 1234 34
1539084151822395648 1234 81
1539084449707598963 2345 56
1539084449707598123 2345 52
and other tags. Both artnum and duration are fields (that is changeable though). I'm now trying to create a query (to use in grafana) that gives me the following result with a calculated mean() and the number of measurements for that artnum:
artnum mean_duration no. measurements
------ -------- -----
1234 58 2
2345 54 2
First of all: Is it possible to exclude the time column? Secondly, what is the influx db way to create such a table? I started with
SELECT mean("duration"), "artnum" FROM "results"
resulting in ERR: mixing aggregate and non-aggregate queries is not supported. Then I found https://docs.influxdata.com/influxdb/v1.6/guides/downsampling_and_retention/, which looked like what I wanted to do. I then created a infinite retention policy (duration 0s) and a continuous query
> CREATE CONTINUOUS QUERY "cq" ON "test" BEGIN
SELECT mean("duration"),"artnum"
INTO infinite.mean_duration
FROM infinite.test
GROUP BY time(1m)
END
I followed the instructions, but after I fed some data to the db and waited for 1m, `SELECT * FROM "infinite"."mean_duration" did not return anything.
Is that approach the right one or should I continue somewhere else? The very goal is to see the updated table in grafana, refreshing once a minute.
InfluxDB is a time series database, so you really need the time dimension - also in the response. You will have a hard time with Grafana if your query returns non time series data. So don't try to remove time from the query. Better option is to hide time in the Grafana table panel - use column styles and set Type: Hidden.
InfluxDB doesn't have a tables, but measurements. I guess you need query with proper grouping only, no advance continous queries, etc.. Try and improve this query*:
SELECT
MEAN("duration"),
COUNT("duration")
FROM results
GROUP BY "artnum" fill(null)
*you may have a problem with grouping in your case, because artnum is InfluxDB field - better option is to save artnum as InfluxDB tag.

Spark window functions: how to implement complex logic with good performance and without looping

I have a data set that lends itself to window functions, 3M+ rows that once ranked can be partitioned into groups of ~20 or less rows. Here is a simplified example:
id date1 date2 type rank
171 20090601 20090601 attempt 1
171 20090701 20100331 trial_fail 2
171 20090901 20091101 attempt 3
171 20091101 20100201 attempt 4
171 20091201 20100401 attempt 5
171 20090601 20090601 fail 6
188 20100701 20100715 trial_fail 1
188 20100716 20100730 trial_success 2
188 20100731 20100814 trial_fail 3
188 20100901 20100901 attempt 4
188 20101001 20101001 success 5
The data is ranked by id and date1, and the window created with:
Window.partitionBy("id").orderBy("rank")
In this example the data has already been ranked by (id, date1). I could also work on the unranked data and rank it within Spark.
I need to implement some logic on these rows, for example, within a window:
1) Identify all rows that end during a failed trial (i.e. a row's date2 is between date1 and date2 of any previous row within the same window of type "trial_fail").
2) Identify all trials after a failed trial (i.e. any row with type "trial_fail" or "trial success" after a row within the same window of type "trial_fail").
3) Identify all attempts before a successful attempt (i.e. any row with type "attempt" with date1 earlier than date1 of another later row of type "success").
The exact logic of these conditions is not important to my question (and there will be other different conditions), what's important is that the logic depends on values in many rows in the window at once. This can't be handled by the simple Spark SQL functions like first, last, lag, lead, etc. and isn't as simple as the typical example of finding the largest/smallest 1 or n rows in the window.
What's also important is that the partitions don't depend on one another so this seems like this a great candidate for Spark to do in parallel, 3 million rows with 150,000 partitions of 20 rows each, in fact I wonder if this is too many partitions.
I can implement this with a loop something like (in pseudocode):
for i in 1..20:
for j in 1..20:
// compare window[j]'s type and dates to window[i]'s etc
// add a Y/N flag to the DF to identify target rows
This would require 400+ iterations (the choice of 20 for the max i and j is an educated guess based on the data set and could actually be larger), which seems needlessly brute force.
However I am at a loss for a better way to implement it. I think this will essentially collect() in the driver, which I suppose might be ok if it is not much data. I thought of trying to implement the logic as sub-queries, or by creating a series of sub-DF's each with a subset or reduction of data.
If anyone is aware of any API's or techniques that I am missing any info would be appreciated.
Edit: This is somewhat related:
Spark SQL window function with complex condition

BIRT Mathematics on data set

I am pretty new to Birt reporting, i can make simple charts and tables, bet when it comes to some calculation of values from data sets - i have no clue in which direction should i watch.
For example i have this simple data set:
count1 count2 max type lenght
616 3858 21 STEEL 20
723 4432 14 STEEL 40
854 5869 21 ALL 20
838 5225 14 ALL 40
And i would like to have Birt to calculate approximately this:
SUM(count2)/SUM(count1) WHERE type=ALL
so this
((5869+5225)/(854+838))
My question would be how could i get there. At this point i think i would need just the right direction how these kind of operations could be made.
Thanks in advance.
I presume you want to display this value somewhere on the report, like at the bottom of a table where this data is presented. If this is the case then add an aggregation data element to the footer of the table where the data is displayed. In the aggregation data element you will have the opportunity to specify an expression SUM(count2) and a filter condition [type=ALL]. You can then do the division in another data element.
If you want to simply compute the value you described you you could do it with SQL, i.e. SELECT SUM(count2)/SUM(count1) FROM myTable WHERE type='ALL'. You would have to provide the data to the report as a data source which could be in the form of an excel spreadsheet, *.csv, database connection, etc.
So there are a few ways to do this depending on what your requirements are.

Cannot allocate memory for a column of compound floats on a partitioned table

I have a partitioned table in my hdb that includes a column containing large lists of floats (at most 400 floats per element). eg each element looks like
(100.0 1.0 ...)
When trying to select on this column from days where there are particularly high numbers of rows I get an error saying
'./2015.02.07/table/column# Cannot allocate memory
The same error arises from a query like:
select column[;0] from table where date=2015.02.07
even though on days with fewer rows this query returns the first value of each element in the column.
Is there a way to stream this column in a select to decrease the memory requirements of holding the whole column in memory for a large day?
EDIT
.Q.ind on large days fails with the same error.
ie given I can work with 2015.02.01 but not 2015.02.02:
.Q.ind[select from table where date=2015.02.01;enlist 1]
is fine but
.Q.ind[select from table where date=2015.02.02;enlist 1]
fails with
{0!$[#.Q.pm;p3;(?).]#[x;0;p1[;y;z]]}
'./2015.02.10/table/column2#: Cannot allocate memory
#
.[?]
(+`time`sym`column1`column2!`:./2015.02.02/table;();0b;())
I should note I am using the free 32-bit version
I think this is all just a combination of the free-32bit memory limitation, the fact that your row counts are possibly large and the fact that (unavoidably) something must be pulled entirely into memory when retrieving data from a column, whether it is the column itself that gets entirely pulled in (in the non-nested case) or if its the nested-index column that gets entirely pulled in.
Another thing to consider is that kdb uses powers-of-two (buddy) memory allocation. Even if todays table only contains one more row than yesterdays, the memory requirements per column could double. Take a simple example:
In the free 32bit version (windows) you can create this many floats and it only uses ~1.07gb of memory
q)\ts 134217726?1.0
3093 1073741952
However, try to generate one extra float and you hit a memory limit
q)\ts 134217727?1.0
wsfull
So even a small amount of rows in the difference between one day and the next can be very significant if you're near the boundary of allocatable powers of two.
--DISCLAIMER-- the following is hacky and is only intended for debugging!
You can actually manually try to access the data from the nested list, though you may still have memory issues here anyway.
Create a nested table and splay it
q)tab:([] col1:(101 102 103f;104 105f;106 107 108 109 110f;111 112f))
q)tab
col1
--------------------
101 102 103f
104 105f
106 107 108 109 110f
111 112f
q)
q)`:test/ set tab
`:test/
You can try to read in the indices from the nested-index file
q)2_first (enlist "j";enlist 8)1:`:test/col1
3 5 10 12
So the indices for splitting the full list of floats (the col1# file) is index 3, index 5, 10 etc etc
Say I want the first 3 rows
q)myrows:3#2_first (enlist "j";enlist 8)1:`:test/col1
q)myrows
3 5 10
then I know that I need the first 10 floats from the col1# file and need to split them at index 3 and 5. Then I can read the col1# file partially and split it correctly
q)(0,-1_myrows) cut raze (enlist "f";enlist 8)1:(`$":test/col1#";0;8*last myrows)
101 102 103f
104 105f
106 107 108 109 110f
But this is precisely what KDB does under the covers anyway so I suspect that you'll still have trouble even reading in the nested-index file in the first place.
Check this debug/hack and see if you can partially read that way. But obviously it's not a long-term solution!
Nested columns make querying in the usual way difficult, as the # file also needs to be loaded into memory (even with a [;0])
Your best bet is to select map a date partition in, and then select within that chunk by chunk, e.g. a million rows at a time (or whatever is sensible given the size of nested floats).
Perhaps also consider 32bit floats, if some decimal accuracy can be sacrificed.
EDIT
So after comments I guess the best way is to go each partition a number of lines at a time with .Q.ind
Just to give my 2 cents on this, I had a similar error but with a 64-bit instance.
I suspected that the memory needed to be de-fragmented as it was running for almost a year.
Bouncing the instance solved the issue, and released a lot of virtual memory