How to calculate Quartiles in Pyspark?

How to calculate Quartiles in Pyspark? - pyspark

I want to calculate mean,min,max,stddev, 25%,50%,75% values for the data in the spark data frame.
I have tried Summary() function but it does not give the exact values of 25%,50% and 75%... the values change in every run even if the data is same.
How to calculate exact 25%,50% and 75% along with other statistics?

Dataset.summary uses ApproximatePercentile to compute the quartiles. If you need exact quartile use percentile as below-
> SELECT percentile(col, 0.3) FROM VALUES (0), (10) AS tab(col);
3.0
> SELECT percentile(col, array(0.25, 0.75)) FROM VALUES (0), (10) AS tab(col);
[2.5,7.5]

Related

Get percent rank for a given value in a given table column

I have a Postgres table with about 500k rows. One of the columns called score has values ranging from 0-1. The data is not normally distributed.
Say I have an observation of 0.25. I'd like to find out where this would fall in the distribution of the score column. This is sometimes referred to as the percent rank.
E.G. a value of 0.25 is in the 40th percentile. This would mean that a value of 0.25 is larger than 40% of the observations in the table.
I know I can calculate a frequency distribution with something like below but this feel like overkill when all I want is a percentile value.
select k, percentile_disc(k) within group (order by mytable.score)
from mytable, generate_series(0.01, 1, 0.01) as k
group by k

Sounds like you want the hypothetical-set aggregate function percent_rank():
SELECT percent_rank(0.25) WITHIN GROUP (ORDER BY score)
FROM mytable;
The manual:
Computes the relative rank of the hypothetical row, that is (rank - 1) / (total rows - 1). The value thus ranges from 0 to 1 inclusive.

Percentage of total changes on applying filter

I have a data table with three dimensions and one measure. For each row, I am trying to calculate the percentage of total (calculated by taking the sum of rows) using a calculated field.
As seen in the screenshot attached, For the column titles 'Dec-19' I want the values to be a percentage of current value / grand total (calculated at the bottom as 122,187)
Screenshot of DataTable:
So e.g. for the Column B value of 2000, the Dec-19 column should be (97/122,187) * 100 = 0.079.
I have achieved this by creating a calculated field with the formula: SUM (sales) / MAX ({EXCLUDE (Column B): Sum (sales}), where sales is the measure used in the datatable.
However, upon application of filter on column B, the percentage value changes. e.g. if I select the value 2000 in my filter for column B, I get the percentage as 100%. It seems as if the percentage is being calculated based on only the rows in the filter.

Since you haven't included any sample data, I created some sample data like this, hope this resemble yours.
Thereafter, I built a cross-tab view in tableau, somewhat like yours
If that is the scenario, use a calculated field, say CF like this, instead of yours
([Sales])/
{FIXED [Col1], DATETRUNC('month', [Col3]) : sum([Sales])}
Dragging it in the view
and thus, filtering won't affect your calculation

Esper: Take the last value from each id and take the mean of all but the most extreme

I have 5 temperature sensors. I want to calculate the mean temperature of 4 - excluding the most extreme value (high or low).
Firstly: will std:unique(id) create a window of the last temperature readings for each id 1-5?
select
avg(tempEvent.temp) as meantemp
from
Event(id in (1, 2, 3, 4, 5)).std:unique(id) as tempEvent
Secondly: how could I change the select statement (possibly using an expression if necessary) to only calculate the mean of four values excluding the most extreme?
The background is, I want to know the deviations of each temperature from the average, but I don't want the average to include an anomalous id. Otherwise all temperatures will look like they are deviating from the average but really only one is.

Finding the average of the middle four values is simple enough, though not as elegant as your solution. The code below will work for any number of temps.
SELECT
AVG(temp) AS meantemp
FROM (
SELECT
temp,
COUNT(temp) AS c,
RANK () OVER (PARTITION BY temp ORDER BY temp) AS r
FROM
[table]
)
WHERE
r > 1
AND r < (c-1)
;
As for your second question, I'm not sure I understand. Do you want the value from among the four middle values that has the greatest absolute deviation from the mean of those four values?

Dividing AVG of column1 by AVG of column2

I am trying to divide the average value of column1 by the average value of column 2, which will give me an average price from my data. I believe there is a problem with my syntax / structure of my code, or I am making a rookie mistake.
I have searched stack and cannot find many examples of dividing two averaged columns, and checked the postgres documentation.
The individual average query is working fine (as shown here)
SELECT (AVG(CAST("Column1" AS numeric(4,2))),2) FROM table1
But when I combine two of them in an attempt to divide, It simply does not work.
SELECT (AVG(CAST("Column1" AS numeric(4,2))),2) / (AVG(CAST("Column2" AS numeric(4,2))),2) FROM table1
I am receiving the following error; "ERROR: row comparison operator must yield type boolean, not type numeric". I have tried a few other variations which have mostly given me syntax errors.

I don't know what you are trying to do with your current approach. However, if you want to take the ratio of two averages, you could also just take the ratio of the sums:
SELECT SUM(CAST(Column1 AS numeric(4,2))) / SUM(CAST(Column2 AS numeric(4,2)))
FROM table1;
Note that SUM() just takes a single input, not two inputs. The reason why we can use the sums is that average would normalize both the numerator and denominator by the same amount, which is the number of rows in table1. Hence, this factor just cancels out.

Postgres: Get percentile of number not necessarily in table column

Imagine I have a column my_variable of floats in my a my_table. I know how to convert each of the rows in this my_variable column into percentiles, but my question is: I have a number x that is not necessarily in the table. Let's call it 7.67. How do I efficiently compute where 7.67 falls in that distribution of my_variable? I would like to be able to say "7.67 is in the 16.7th percentile" or "7.67 is larger than 16.7% of rows in my_variable." Note that 7.67 is not something taken from the column, but I'm inputting it in the SQL query itself.
I was thinking about ordering my_variable in ascending order and counting the number of rows that fall below the number I specify and dividing by the total number of rows, but is there a more computationally efficient way of doing this, perhaps?

If your data does not change too often, you can use a materialized view or a different table, call it percentiles, in which you store 100 or 1.000 (depending on the precision you need). This table should have a descending index on the value column.
Each row contains the minimum value to reach a certain percentile and the percentile itself.
Then you just need to get the first row that have value greater than the given data and read the percentile value.
In you example the table will contain 1.000 rows, and you could have someting like:
Percentile value
16.9 7.71
16.8 7.69
16.7 7.66
16.6 7.65
16.5 7.62
And your query could be something like:
SELECT TOP 1 Percentile FROM percentiles where 7.67 < value ORDER BY value desc
This is a valid solution if the number of SELECTs you make is much bigger than the numbers of updates in the my_table table.

I ended up doing:
select (avg(dummy_var::float))
from (
select case when var_name < 3.14 then 1 else 0 end as dummy_var
from table_name where var_name is not null
)
Where var_name was the variable of interest, table_name was the table of interest, and 3.14 was the number of interest.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to calculate Quartiles in Pyspark? - pyspark

Dataset.summary uses ApproximatePercentile to compute the quartiles. If you need exact quartile use percentile as below- > SELECT percentile(col, 0.3) FROM VALUES (0), (10) AS tab(col); 3.0 > SELECT percentile(col, array(0.25, 0.75)) FROM VALUES (0), (10) AS tab(col); [2.5,7.5]

Related

Get percent rank for a given value in a given table column

Percentage of total changes on applying filter

Esper: Take the last value from each id and take the mean of all but the most extreme

Dividing AVG of column1 by AVG of column2

Postgres: Get percentile of number not necessarily in table column

Categories

Resources