CloudWatch Logs Insights get average of count returned - average

I'm querying CloudWatch logs via Logs Insights queries and am trying to get the average for count returned. For example, my current query that gets the counts is (only relevant parts shown):
fields fieldA as A
| stats count(*) as countForEachA by A
returns this:
A countForEachA
______ _____________
a123 22
a124 22
a125 16
I'm trying to get the average for field countForEachA. Therefore, for the example above, I want the average: 20 (sum of countForEachA divided by total results). Here's what I've tried:
fields fieldA as A
| stats avg(count(*)) as average by A
The query above returns this:
A countForEachA
______ _____________
a123 2.3
a124 1.4
a125 2.9
while I expected a single answer representing the average. Any help will be appreciated.

I got it, to get the average for countForEachA, we first need to get the count value and then simply call the avg() function like so:
fields fieldA as A
| stats count(*) as countForEachA, avg(countForEachA)
Hope this helps others.

Related

Changing a functional qSQL query to involve multiple columns in calculation KDB+/Q

I have a ? exec query like so:
t:([]Quantity: 1 2 3;Price 4 5 6;date:2020.01.01 2020.01.02 2020.01.03);
?[t;enlist(within;`date;(2020.01.01,2020.01.02));0b;(enlist `Quantity)!enlist (sum;(`Quantity))]
to get me the sum of the Quantity in the given date range. I want to adjust this to get me the sum of the Notional in the date range; Quantity*Price. So the result should be (1x4)+(2x5)=14.
I tried things like the following
?[t;enlist(within;`date;(2020.01.01,2020.01.02));0b;(enlist `Quantity)!enlist (sum;(`Price*`Quantity))]
but couldn't get it to work. Any advice would be greatly appreciated!
I would advise in such a scenario to think about the qSql style query that you are looking for and then work from there.
So in this case you are looking, I believe, to do something like:
select sum Quantity*Price from t where date within 2020.01.01 2020.01.02
You can then run parse on this to break it into its function form i.e the ? exec query you refer to.
q)parse"select sum Quantity*Price from t where date within 2020.01.01 2020.01.02"
?
`t
,,(within;`date;2020.01.01 2020.01.02)
0b
(,`Quantity)!,(sum;(*;`Quantity;`Price))
This is your functional form that you need; table, where clause, by and aggregation.
You can see your quantity here is just the sum of the multiplication of the two columns.
q)?[t;enlist(within;`date;(2020.01.01;2020.01.02));0b;enlist[`Quantity]!enlist(sum;(*;`Quantity;`Price))]
Quantity
--------
14
You could also extend this to change the column as necessary and create a function for it too, if you so wish:
q)calcNtnl:{[sd;ed] ?[t;enlist(within;`date;(sd;ed));0b;enlist[`Quantity]!enlist(sum;(*;`Quantity;`Price))]}
q)calcNtnl[2020.01.01;2020.01.02]
Quantity
--------
14

sort data in hdb by using dbmain.q in kdb

I am trying to sort 1 or 2 columns in a hdb in kdb but failed. This is the code I have
fncol[dbdir;`trade;`sym;xasc];
and got a length error when I called it. But I don't have a length error if I use this code
fncol[dbdir;`trade;`sym;asc];.
However this only sorts the sym column itself. I want the data from other columns change according to sym column as well.
In addition, I would like to apply parted attribute to sym column. Also, I tried to sort this way
fncol[dbdir;`trade;`sym`ptime;xasc];. also failed
You should always be careful with dbmaint.q if you are unsure what it is going to do. I gather from the fact asc worked after xasc that you are using a test hdb each time.
fncol should be used with unary functions i.e. 1 argument. It's use case is for modifying individual columns. What you are trying to do is modifying the entire table as you want to sort the entire table relative to the sym column. Using .Q.dpft for each date is what you want as outlined by Cathal in your follow-up question. using .Q.dpft function to resave table
When you run this fncol[dbdir;`trade;`sym;xasc]; You are saving down a projection in place of the sym column in each date.
fncol[`:.;`trades;`sym;xasc];
select from trades where date = 2014.04.21
'length
[0] select from trades where date = 2014.04.21
q)get `:2014.04.21/trades/sym
k){$[$[#x;~`s=-2!(0!.Q.v y)x;0];.Q.ft[#[;*x;`s#]].Q.ord[<:;x]y;y]}[`p#`sym$`A..
// This is the k definition of xasc with the sym column as the first parameter.
q)xasc
k){$[$[#x;~`s=-2!(0!.Q.v y)x;0];.Q.ft[#[;*x;`s#]].Q.ord[<:;x]y;y]}
// Had you needed to fix your hdb, I managed to undo this using value and indexing to the sym col data.
fncol[`:.;`trades;`sym;{(value x)[1]}];
q)select from trades where date = 2014.04.21
date sym time src price size
------------------------------------------------------------
2014.04.21 AAPL 2014.04.21D08:00:12.155000000 N 25.31 2450
2014.04.21 AAPL 2014.04.21D08:00:42.186000000 N 25.32 289
2014.04.21 AAPL 2014.04.21D08:00:51.764000000 O 25.34 3167
asc will not break the hdb as it just takes 1 argument and saves down ONLY the sym column in ascending order not the table.
Is there any indication of what date is failing with a length error? It could be something wrong with one of the partitions.
Perhaps if you try to load one of the dates into memory and sort it manually IE
`sym xasc select from trade where date=last date
that might indicate if there's a specific partition causing issues.
FYI if you're intersted in applying the p# attribute you should try setattrcol in dbmaint.q. I think the data will need to be sorted first though.

Selecting multiple values/aggregators from Influxdb, excluding time

I got a influx db table consisting of
> SELECT * FROM results
name: results
time artnum duration
---- ------ --------
1539084104865933709 1234 34
1539084151822395648 1234 81
1539084449707598963 2345 56
1539084449707598123 2345 52
and other tags. Both artnum and duration are fields (that is changeable though). I'm now trying to create a query (to use in grafana) that gives me the following result with a calculated mean() and the number of measurements for that artnum:
artnum mean_duration no. measurements
------ -------- -----
1234 58 2
2345 54 2
First of all: Is it possible to exclude the time column? Secondly, what is the influx db way to create such a table? I started with
SELECT mean("duration"), "artnum" FROM "results"
resulting in ERR: mixing aggregate and non-aggregate queries is not supported. Then I found https://docs.influxdata.com/influxdb/v1.6/guides/downsampling_and_retention/, which looked like what I wanted to do. I then created a infinite retention policy (duration 0s) and a continuous query
> CREATE CONTINUOUS QUERY "cq" ON "test" BEGIN
SELECT mean("duration"),"artnum"
INTO infinite.mean_duration
FROM infinite.test
GROUP BY time(1m)
END
I followed the instructions, but after I fed some data to the db and waited for 1m, `SELECT * FROM "infinite"."mean_duration" did not return anything.
Is that approach the right one or should I continue somewhere else? The very goal is to see the updated table in grafana, refreshing once a minute.
InfluxDB is a time series database, so you really need the time dimension - also in the response. You will have a hard time with Grafana if your query returns non time series data. So don't try to remove time from the query. Better option is to hide time in the Grafana table panel - use column styles and set Type: Hidden.
InfluxDB doesn't have a tables, but measurements. I guess you need query with proper grouping only, no advance continous queries, etc.. Try and improve this query*:
SELECT
MEAN("duration"),
COUNT("duration")
FROM results
GROUP BY "artnum" fill(null)
*you may have a problem with grouping in your case, because artnum is InfluxDB field - better option is to save artnum as InfluxDB tag.

pentaho distinct count over date

I am currently working on Pentaho and I have the following problem:
I want to get a "rooling distinct count on a value, which ignores the "group by" performed by Business Analytics. For instance:
Date Field
2013-01-01 A
2013-02-05 B
2013-02-06 A
2013-02-07 A
2013-03-02 C
2013-04-03 B
When I use a classical "distinct count" aggregator in my schema, sum it, and then add "month" to column, I get:
Month Count Sum
2013-01 1 1
2013-02 2 3
2013-03 1 4
2013-04 1 5
What I would like to get would be:
Month Sum
2013-01 1
2013-02 2
2013-03 3
2013-04 3
which is the distinct count of all Fields so far. Does anyone has any idea on this topic?
my database is in Postgre, and I'm looking for any solution under PDI, PSW, PBA or PME.
Thank you!
A naive approach in PDI is the following:
Sort the rows by the Field column
Add a sequence for changing values in the Field column
Map all sequence values > 1 to zero
These first 3 effectively flag the first time a value was seen (no matter the date).
Sort the rows by year/month
Sum the mapped sequence values by year+month
Get a Cumulative Sum of all the previous sums
These 3 aggregate the distinct values per month, then keep a cumulative sum. In PDI this might look something like:
I posted a Gist of this transformation here.
A more efficient solution is to parallelize the two sorts, then join at the latest point possible. I posted this one as it is easier to explain, but it shouldn't be too difficult to take this transformation and make it more parallel.

SQL Sum and Group By for a running Tally?

I'm completely rewriting my question to simplify it. Sorry if you read the prior version. (The previous version of this question included a very complex query example that created a distraction from what I really need.) I'm using SQL Express.
I have a table of lessons.
LessonID StudentID StudentName LengthInMinutes
1 1 Chuck 120
2 2 George 60
3 2 George 30
4 1 Chuck 60
5 1 Chuck 10
These would be ordered by date. (Of course the actual table is thousands of records with dates and other lesson-related data but this is a simplification.)
I need to query this table such that I get all rows (or a subset of rows by a date range or by student), but I need my query to add a new column we might call PriorLessonMinutes. That is, the sum of all minutes of all lessons for the same student in lessons of PRIOR dates only.
So the query would return:
LessonID StudentID StudentName LengthInMinutes PriorLessonMinutes
1 1 Chuck 120 0
2 2 George 60 0
3 2 George 30 60 (The sum Length from row 2 only)
4 1 Chuck 60 120 (The sum Length from row 1 only)
5 1 Chuck 10 180 (The sum of Length from rows 1 and 4)
In essence, I need a running tally of the sum of prior lesson minutes for each student. Ideally the tally shouldn't include the current row, but if it does, no big deal as I can do subtraction in the code that receives the query.
Further, (and this is important) if I retrieve only a subset of records, (for example by a date range) PriorLessonMinutes must be a sum that considers rows that are NOT returned.
My first idea was to use SUM() and to GROUP BY Student, but that isn't right because unless I'm mistaken it would include a sum of minutes for all rows for each student, including rows that come after the row which aren't relevant to the sum I need.
OPTIONS I'M REJECTING: I could scan through all rows in my code that receives it, (although this would force me to retrieve all rows unnecessarily) but that's obviously inefficient. I could also put a real data field in there and populate it, but this too presents problems when other records are deleted or altered.
I have no idea how to write such a query together. Any guidance?
This is a great opportunity to use Windowed Aggregates. The trick is that you need SQL Server 2012 Express. If you can get it, then this is the query you are looking for:
select *,
sum(LengthInMinutes)
over (partition by StudentId order by LessonId
rows between unbounded preceding and 1 preceding)
as PriorLessonMinutes
from Lessons
Note that it returns NULLs instead of 0s (zeroes). If you insist on zeroes, use COALESCE function to turn NULLs into zeroes.
I suggest using a nested query to limit the number of rows returned:
select * from
(
select *,
sum(LengthInMinutes)
over (partition by StudentId order by LessonId
rows between unbounded preceding and 1 preceding)
as PriorLessonMinutes
from Lessons
) as NestedLessons
where LessonId > 3 -- this is an example of a filter
This way the filter is applied after the aggregation is complete.
Now, if you want to apply a filter that doesn't affect the aggregation (like only querying data for a certain student), you should apply the filter to the inner query, as pruning the rows that don't affect the computation early (like data for other students) will improve the performance.
I feel the following code will serve your purpose.Check it:-
select Students.StudentID ,Students.First, Students.Last,sum(Lessons.LengthInMinutes)
as TotalPriorMinutes from lessons,students
where Lessons.StartDateTime < getdate()
and Lessons.StudentID = Students.StudentID
and StartDateTime >= '20090130 00:00:00' and StartDateTime < '20790101 00:00:00'
group by Students.StudentID ,Students.First, Students.Last