y axis limit - appears to alter the data that is analysed - axis

I have a bunch of data where the hours taken to process an item ranges from 3-3000 hours. most of the data is <1000 hours
I am creating a boxplot of that data. I have an large number of outliers within the data that I don't need to display, but I do need to analyse.
I have tried to use both 'scale_y_continuous(limits=c(0,1000))' and 'ylim(0,1000)' that appears to change the data that is used to create the boxplot, I altered the limits to '20' to test this theory and I get a complete plot, which can only be because the method i'm using to limit the axis also limits the range of data analysed.
I'd like to limit the y axis but not limit the range of data that is used in the analysis, what function do I use to accomplish that?
many thanks

it appears that it's 'coord_cartesian(ylim = c(nnn,nnn))+' that I needed to use.

Related

Athena/DDB to condense millions of data points for plotting them on a graph

I need to plot trend charts on the react app based on user inputs such as timestamps, devices, etc. I have related time series data in DynamoDB and S3 (which I can query using Athena).
Returning all those millions of data points for a graph seems unreasonable and is super laggy.
I guess one option is "binning" where I decide the number of bins based on how big the time range is and take averages of the readings in that bin. However, concerned about how well it will show the drops and high we need to show them accurately.
Athena queries and DDB queries (due to the 1MB limit) - both seem fairly slow so far.
Of course the size of the response payload is another concern as API and Lambda both limit it to 10 and 6Mb respectively.
Any ideas?
I can't suggest anything smarter than "binning", but if you are concerned that the bucket interval might become too wide and performance might suffer, you can fixate the interval. Then create more than one table. For example, the interval can be 1 hour and you can have a new table for each week.
This is what we did when we had to deal with time series in dynamo. At some point, we decided to switch to Amazon Timestream

How to plot daily increment data from a sparse data set with interpolation in Grafana?

How can I plot time-grouped increment data in a bar graph in Grafana, but with a sparse data source that needs interpolation BEFORE calculating the increment?
My data source is an InfluxDB with a sparse time series of accumulated values (think: gas meter readings). The data points are usually a few days apart.
My goal is to create a bar graph with value increase per day. For the missing values, linear interpolation will do just fine.
I've come up with
SELECT spread("value") FROM "gas" WHERE $timeFilter GROUP BY time(1d) fill(linear)
but this won't work as the fill(linear) command is executed AFTER the spread(value) command. If I use time periods much greater than my granularity of input data (e.g. time(14d)), it shows proper bars, but once I use smaller periods, the bars collapse to 0.
How can I apply the interpolation BEFORE the difference operation?
Described situation by you is caused by fact that fill() fills data only if you do not have anything in your group by time() period in your query. If you get spread=0 then you probably have only one value in this period, so no fill() is used.
What I can suggest to you is to use subquery with lower group period time to prepare interpolation of your original signal. This is an example:
SELECT spread("interpolated_value") FROM (
SELECT first("value") as "interpolated_value" from "gas"
WHERE $timeFilter
GROUP BY time(10s) fill(linear)
)
GROUP BY time(1d) fill(none)
Subquery will prepare value for each 10s period (I recommend to set this value possibly as high as you can accept). If in 10s periods are values, it will pick the first one, if there is no value in this period, it will do an interpolation.
In main query there is an usage from prepared interpolated set of values to calculate spread.
All above only describes how you can get interpolated data within shorted periods. I strongly recommend to think about usability of this data. Calculating spread from lineary interpolated data may have questionable reliability.

Removing outlier with multiple consecutive values similar to a step

I am processing an ocean wave data, where I have a timeseries of the Peak Wave Period (Tp (s)). The typical values for Tp ranges from 2s-15s for this location. However, it may reach higher values above 15s during extreme events such as a storm. Hence, removing data based on a threshold value is not suitable.
As you can see in the figure below, there are multiple values that are outliers. The high values occurred for a small duration and then dropped down. An extreme event would last for hours.
I have tried the functions filloutlier and medfilt1, but they are not successful in removing the outlier, which I presume is because multiple consecutive outlier data points exists.
Is there a built-in Matlab function exist to handle such situation?
Else, if I need to write my own function to filter such signals, could you provide some guidance.
Attaching a small data sample here as well: Download Data
Dataset plot (Only the segment in the provided data above)
Zoomed in plot at one of the outliers.
If we know that we need the values to be in the range of (2,15), we can clip the values > 15 to 15.
Another way is to use the value of a high percentile (say 95) of the observations and clip values about it.
filloutlier, medfilt1 methods are not removing values like 18 because they are not treating them as outliers. 18 is not very far away from the typical range of (2, 15).

What's the most efficient way to use large data from Excel in my C# code?

I ran a computer simulation for my Pendulum, to measure time taken to reach the lowest point, for every velocity and every angle.
As you can imagine there is a lot of data, thousands of lines for all angles and velocity.
On every frame, I will be measuring the velocity and angle of the pendulum, and will look for the closest data in my Excel spreadsheet.
How can I go about this to make sure it's not too CPU-intensive?
Should I create a massive array where every element corresponds to a certain angle: for example, myArray[30] will be for all velocities and times for all my data between 30.0 degrees and 30.999. (That way it will be avoid lots of if statements)
Or should I keep everything in my Excel spreadsheet?
Any suggestion?
The best approach in my opinion would be dividing your data into intervals based on distribution since you have to access that data in every frame. Then when you measure the velocity and angle you can go look for the interval and access only that part of your data.
I would find maximum and minimum of your data points while importing to Unity and then divide that part based on (maximum - minimum) / NumOfIntervals. Lets say your interval size is 5 for each Angle. When you got an angle of 17 you can do (int)15/5 = 3(Assuming indexes start from zero) and go for third item in your structure. This can be a dictionary or Array of an Arbitrary class instances based on your data.
I can try to help further if you can share the structure of your data. But in my opinion evenly distribution of data to every interval is important.

splunk - create chart with values function

Background: Soon after my performance test, the response time details of the business transactions will be fed into splunk, from where I need to generate trend graph for couple of transactions for certain period of time.
Now I am able to group the response time data for these transactions but in visualization, not able to generate the chart.
Refer:SS01.jpg for more details.
query used: index=xyz source=abc (Period!=Period) Transaction_Name=Search OR Address_Book OR Policy | chart values(Average) as Average by Transaction_Name
I want the chart to appear in either format A or B as it appears on SS02.jpg.
Please help me on this.
Thanks.
You have very low values (i.e. 0.046, 0.099, etc) in the query output and very high value in the chart Y axis (100), in order to get the desired picture just reduce the maximum value for the vertical axis like:
To make the changes permanent you can amend charting.secondaryAxis.maximumNumber parameter value like it described in Controlling chart (y axis) range
Personally I prefer BM.Sense output for performance test results analysis.