What does "group measurements by" mean in the node Outlier Removal (Knime)? - boxplot

I have imposed "box plot" as method and 1.5 as factor.
Node description tells:
"Subsets
Select the columns by which the measurements should be grouped (example: plates, batches, runs...)"
What is the function "group measurement by"? Aren't the outliers measured using Mean + IQR*(1.5) and Mean - IQR*(1.5) independentely of others columns?

It means you do not need to group loop the table for each plate/batch/... (like in High Throughput Screening), but you can still find the outliers in those groups. In case you do not need groups, you can still not group by (or in case you have to, you can group by a constant column).

Related

How to plot daily increment data from a sparse data set with interpolation in Grafana?

How can I plot time-grouped increment data in a bar graph in Grafana, but with a sparse data source that needs interpolation BEFORE calculating the increment?
My data source is an InfluxDB with a sparse time series of accumulated values (think: gas meter readings). The data points are usually a few days apart.
My goal is to create a bar graph with value increase per day. For the missing values, linear interpolation will do just fine.
I've come up with
SELECT spread("value") FROM "gas" WHERE $timeFilter GROUP BY time(1d) fill(linear)
but this won't work as the fill(linear) command is executed AFTER the spread(value) command. If I use time periods much greater than my granularity of input data (e.g. time(14d)), it shows proper bars, but once I use smaller periods, the bars collapse to 0.
How can I apply the interpolation BEFORE the difference operation?
Described situation by you is caused by fact that fill() fills data only if you do not have anything in your group by time() period in your query. If you get spread=0 then you probably have only one value in this period, so no fill() is used.
What I can suggest to you is to use subquery with lower group period time to prepare interpolation of your original signal. This is an example:
SELECT spread("interpolated_value") FROM (
SELECT first("value") as "interpolated_value" from "gas"
WHERE $timeFilter
GROUP BY time(10s) fill(linear)
)
GROUP BY time(1d) fill(none)
Subquery will prepare value for each 10s period (I recommend to set this value possibly as high as you can accept). If in 10s periods are values, it will pick the first one, if there is no value in this period, it will do an interpolation.
In main query there is an usage from prepared interpolated set of values to calculate spread.
All above only describes how you can get interpolated data within shorted periods. I strongly recommend to think about usability of this data. Calculating spread from lineary interpolated data may have questionable reliability.

Hiding column in Spotfire CrossTable

I have one data table with various identifiers in 3 columns (Called BU, Company, and Group). I created a cross table that sums the face by 2 layers – an identifier (‘Actual’ and ‘Plan’) and a reporting period (‘9/30/16’ and '9/30/17'). The table was easy, aside from the variance section. I am currently using the formula to compute the variance
SN(Sum([Face]) - Sum([Face]) OVER (ParallelPeriod([Axis.Columns])),
Sum([Face])) AS [PlanVariance]
Unfortunately, this gives me the correct values in the Plan Variance section of the cross table, for the plan identifier. However, it provides the wrong values in the actual identifier. (The actual identifier under plan variance is equal to the actual identifier under the Sum (Face) section. If I remove the SN function, the Plan Variance is empty for all identifiers that have no face for a group AND is empty for the actual section under Plan Variance.
Is there a way to create a cross table that would show the variance for the Plan Identifier ONLY? Can I stop the cross table from calculating the plan variance on the actual segment? Or is there a way to have the actual field hidden in the plan variance section of the final visualization?
Thanks for any help/advice you can provide!

Sum Filtered Data in Tableau

I have a database of users and each user record has "User ID" and "Group". After filtering out a chunk of the records, I'd like to sum the number of users within each group. Currently I am doing that with the calculation:
{FIXED[Group]:SUM([Number of Records])}
The problem here is this calculation appears to ignore any records that I've filtered out and just gives a total count per group from all of the unfiltered data.
Is there a quick way to sum the number of visible users in each group after applying a filter?
The easiest way of solving this would be to take advantage of the order of operations in Tableau.
The issue you are having at the moment is the LOD calculation is performed prior to a dimension filter.
If you want to calculate a field at a different level of detail then the view than a LOD is still the way to go. All you need to do is force tableau to apply the filters before calculating the fixed calculation.
In order to do this change your filters to a context filter. This is done by right clicking on the filter and selecting "Add to context. You will see the filter change from blue to grey.
Your calculated field should now be sensitive to any context filters.
Find out more here

How to calculate the mean of a dataframe column and find the top 10%

I am very new to Scala and Spark, and am working on some self-made exercises using baseball statistics. I am using a case class create a RDD and assign a schema to the data, and am then turning it into a DataFrame so I can use SparkSQL to select groups of players via their stats that meet certain criteria.
Once I have the subset of players I am interested in looking at further, I would like to find the mean of a column; eg Batting Average or RBIs. From there I would like to break all the players into percentile groups based on their average performance compared to all players; the top 10%, bottom 10%, 40-50%
I've been able to use the DataFrame.describe() function to return a summary of a desired column (mean, stddev, count, min, and max) all as strings though. Is there a better way to get just the mean and stddev as Doubles, and what is the best way of breaking the players into groups of 10-percentiles?
So far my thoughts are to find the values that bookend the percentile ranges and writing a function that groups players via comparators, but that feels like it is bordering on reinventing the wheel.
I was able to get the percentiles by using Windows Functions and apply ntile() and cumeDist() over the window. The ntile() can create grouping based off of an input number. If you want things grouped by 10%, just enter ntile(10), if by 5% then ntile(20). For a more fine-tuned restult, cumeDist() applied over the window will output a new column with the cumulative distribution, and those can be filtered from there through select(), where(), or a SQL query.

Adding Reference Line for Weighted Average in Tableau

I've got a bar chart with three months worth of data. Each column in the chart is one month's data showing the percentage of Rows that met a certain criterion for that month. In the first month, 100% of 2 rows meet the measure. In the second month, 24.2% of 641 rows meet the measure. In the 3rd month, 28.3% of 1004 rows meet the measure. My reference line which is supposed to show the average across the entire time-frame is showing 50.8%, the simple average (i.e. [100+24.2+28.3]/3) instead of the weighted average (i.e. [100*2+641*24.2+1004*28.3]/[2+641+1004]).
In the rows shelf, I have a measure called "% that meet the criterion", this is defined as SUM([Criterion])/SUM([NUMBER OF RECORDS])
The criterion measure is 1 for any record that qualifies and null for any that do not qualify.
If I go to Analysis >> Totals >> Show Row Grand Totals, a 4th bar is added, and that bar shows the correct weighted average of the other three bars (26.8%), but I really want this to be shown as a reference line instead of having an extra bar on the chart. (Adding the Grand Total bar also drops the reference line down to 44.8%, which is the simple average of the 4 bars now shown on the chart--I can't think of a less useful piece of information than that).
How can I add the weighted average as a reference line?
Instead of using 'Average' as your aggregation, try using 'Total' instead in the Edit Reference Line dialogue window.
I have to say it's a bit counter-intuitive, but this is what the Tableau online help has to say about it:
http://onlinehelp.tableau.com/current/pro/online/mac/en-us/reflines_addlines.html
Total - places a line at the aggregate of all the values in either the cell, pane, or the entire view. This option is particularly useful when computing a weighted average rather than an average of averages. It is also useful when working with a calculation with a custom aggregation. The total is computed using the underlying data and behaves the same as selecting one of the totals option the Analysis menu.
If you are using Tableau 9, you can make second calculated field using an LOD expression
{ SUM([Criterion]) / SUM([NUMBER OF RECORDS]) }
This will calculate the ratio for the entire data set after applying context and data source filters, without partitioning the data by any of the other dimensions in your view (such as month in your case)
If you place that new field on the detail shelf then you can use it to create a reference line.
There are other ways to generate a weighted average, but this is probably the simplest in your case.