Aggregate function over a given time interval spark - scala

Please , i need your help please , i need to aggregate a dataset based on a 5 minute interval and aggregating based on average function ,here you may find input and expected output .,your help will be highly appreciated ,the first column is a timestamp column and am using scala language

Generally you can extract the 5 minutes bucket from each time (e.g. by getting the timestamp as a number, dividing by 5 minutes and flooring the result).
Then you simply do:
df.groupBy("bucket").avg($"value")

Related

How to do a distinct count of a metric using graphite datasource in grafana?

I have a metric that shows the state of a server. The values are integers and if the value is 0 (zero) then the server is stable, else it is unstable. And the graph we have is at a minute level. So, I want to show an aggregated value to know how many hours the server is unstable in the selected time range.
Lets say, if I select "Last 7 days" as the time duration...we have get X hours of instability of server.
And one more thing, I have a line graph (time series graph) that shows the state of server...but, the thing is when I select "Last 24 hours or 48 hours" I am getting the graph at a minute level...when I increase the duration to a quarter I am getting the graph for every 5 min or something like that....I understand it's aggregating the values....but does any body know how the grafana is doing the aggregation ??
I have tried "scaleToSeconds" function and "ConsolidateBy" functions and many more to first get the count of non zero value minutes, but no success.
Any help would be greatly appreciated.
Thanks in advance.
There are a few different ways to tackle this, there are 2 places that aggregation happens in this situation:
When you query for a time range longer than your raw retention interval and whisper returns aggregated data. The aggregation method used here is defined in your carbon aggregation configuration.
When Grafana sends a query to Graphite it passes maxDataPoints=<width of graph in pixels>, and Graphite will perform aggregation to return at most that many points (because you don't have enough pixels to render more points than that). The method used for this consolidation is controlled by the consolidateBy function.
It is possible for both of these to be used in the same query if you eg have a panel that queries 3 days worth of data and you store 2 days at 1-minute and 7 days at 5-minute intervals in whisper then you'd have 72 * 60 / 5 = 864 points from the 5-minute archive in whisper, but if your graph is only 500px wide then at runtime that would be consolidated down to 10-minute intervals and return 432 points.
So, if you want to always have access to the count then you can change your carbon configuration to use sum aggregation for those series (and remove the existing whisper files so new ones are created with the new aggregation config), and pass consolidateBy('sum') in your queries, and you'll always get the sum back for each interval.
That said, you can also address this at query time by multiplying the average back out to get a total (assuming that your whisper aggregation config is using average). The simplest way to do that will be to summarize the data with average into buckets that match the longest aggregation interval you'll be querying, then scale those values by that interval to calculate the total number of minutes. Finally, you'll want to use consolidateBy('sum') so that any runtime consolidation will work properly.
consolidateBy(scale(summarize(my.series, '10min', 'avg'), 60), 'sum')
With all of that said, you may want to consider reporting uptime in terms of percentages rather than raw minutes, in which case you can use the raw averages directly.
When you say the value is zero (0), the server is healthy - what other values are reported while the server is unhealthy/unstable? If you're only reporting zero (healthy) or one (unhealthy), for example, then you could use the sumSeries function to get a count across multiple servers.
Some more information is needed here about the types of values the server is reporting in order to give you a better answer.
Grafana does aggregate - or consolidate - data typically by using the average aggregation function. You can override this using the 'sum' aggregation in the consolidateBy function.
To get a running calculation over time, you would most likely have to use the summarize function (also with the sum aggregation) and define the time period, e.g. 1 hour, 1 day, 1 week, and so on. You could take this a step further by combining this with a time template variable so that as the period grows/shrinks, the summarize period will increase/decrease accordingly.

Flatline: How to calculate days between two dates

I want to add a calculated field in a BigML datasheet with the result of days between two dates.
I'm trying to figure out how to calculate the number of days between two date fields with Flatline language but I don't know how to do it even reading the doc.
Any clue about how create this calculated field?
PS: Somebody with enough reputation could create and add tags "bigml" and "flatline"?
Currently, the only way to subtract dates is by first transforming them to an
epoch (number of milliseconds since 1970) and then computing the difference:
(- (epoch "12/03/1990") (epoch "01/01/1988"))
That will give you the number of milliseconds between the two dates, which then
can be transformed to other units. What that won't give you is of course the
difference in calendar days: we don't have yet in Flatline a way of subtracting
calendar dates. But it shouldn't be too difficult to add them if it's a feature
you need :)

DRUID : Get Number of days for a given date range to get average

Need to have the total number of days for a given interval. I have already tried using distinct of dates for the purpose. It fails because for some days data is not present. Once I have the number of days for the given date interval I will be able to find the daily average.
Any other approaches is also welcomed

Calculating the difference between two dates using age and extract gives differing results in Postgresql

I'm using Postgresql (on Amazon Redshift), and I need to calculate the difference between two dates and then use that value in a formula to compute a ratio, so the date difference needs to be translated to a numeric value, preferably a float or double precision.
I have two dates: 1/1/2017 and 1/1/2014. I need to find the difference between these two dates in number of days.
When I use the age function I get 1080 days:
select age('2017-01-01','2014-01-01')
However, since age returns an interval and I need to work with a numeric result, I am using EXTRACT to convert the final value. I chose epoch since I wasn't able to find any other value for EXTRACT that would yield the number of time units between the two dates. This formula yields 1095.75 days (the divisor is the number of seconds in a day):
select extract(epoch from age('2017-01-01','2014-01-01'))/86400
Why am I getting a difference of 19.75 days when using age vs using extract?
Did you try
select '2017-01-01'::date - '2014-01-01'::date;
The difference between two dates is number of days in integer
1080 is the figure you would get if every month was 30 days long (36 months by 30 days equals 1080), as it would be if you used justify_days (either explicitly or if the DBMS called it implicitly). You don't say how you're getting this 1080 figure since I believe the duration would normally just print out something like 3 years, but that seems the most likely case
1095.75 seems the more correct figure, being 365.25 days multiplied by three years.
Out of those two, I would go with the latter method.
Although, as pointed out at http://www.postgresql.org/docs/8.1/static/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT, calculating the difference between two date types should yield the number of days:
select dtend - dtstart from somewhere
Redshift release notes say they recently released a months between function which looks similar to oracles months between function if that's what you're looking for. http://docs.aws.amazon.com/redshift/latest/dg/r_MONTHS_BETWEEN_function.html

matlab query in order to see how used the app is

I have a MySQL table with over 6 million records, each with a epoch timestamp. I need to plot all the timestamps across time of day. In other words, I need to see how many timestamps are between 7am and 8am, 8am to 9am, etc - for all 24 hour blocks in day. I do not need them plotted by day of the week or month, just time in the day. Each timestamp is in UTC.
can someone help me?
You could use MySQL's FROM_UNIXTIME function to get date strings from the database, and dump the results into a file, which you can subsequently read into MATLAB. Next, one of the ways to extract the time of day of each record is to use MATLAB's datevec function, to get each component of the date string seperately:
datevec('2007-11-30 10:30:19')
ans =
2007 11 30 10 30 19
For instance, if you read in the data as one long vector with date strings, you could apply datevec to this vector, and subsequently grab the 'hour column' of the resulting matrix. Then, you can make a histogram of the counts using the hist or histc functions, depending on whether you want to specify bin centers or bin edges. If you have an hour column H, something like hist(H, 0:23) should work. The histc function might be a bit more natural for the nature of your data, but is slightly more involved; check the documentation.