Calculating median of column "Balance" from table "Marketing" - scala

I have a spark (scala) dataframe "Marketing" with approx 17 columns with 1 of them as "Balance". The data type of this column is Int. I need to find the median Balance. I can do upto arranging it in ascending order, but how to proceed after that? I have a given hint that the percentile function of scala can be used. I don't have any idea about this percentile function. Can anyone help?

Median is the same thing as the 50th percentile. If you do not mind using hive functions you can do one of the following:
marketingDF.selectExpr("percentile(CAST(Balance AS BIGINT), 0.5) AS median")
If you do not need an exact figure you can look into using percentile_approx() instead.
Documentation for both functions is located here.

Related

How can I optimize calculation of mean in pyspark while ignoring Null values

I have a Pyspark Dataframe with around 4 billion rows, so efficiency in operations is very important. What I want to do seems very simple. I want to calculate the average value from two columns, and if one of them is Null I want to only return the non-null value. In Python I could easily accomplish this using np.nanmean, but I do not believe anything similar is implemented in Pyspark.
To clarify the behavior I am expecting, please see the below example rows:
user_id col_1 col_2 avg_score
1 32 12 22
2 24 None 24
Below is my currently implementation. Note that all values in col_1 are guaranteed to be non-null. I believe this can probably be further optimized:
from pyspark.sql import functions as f_
spark_df = spark_df.na.fill(0, 'col_2')
spark_df = spark_df.withColumn('avg_score',
sum([spark_df[i] for i in ['col_1','col_2']) /
sum([f_.when(spark_df[i] > 0, 1).otherwise(0) for i in ['col_1','col_2']]))
If anyone has any suggestions for whether there is a more efficient way to calculate this I would really appreciate it.

How to get the average of multiple columns with NULL in PostgreSQL

AVG function in PostgreSQL ignores NULL values when it calculates the average. But what if I want to count the average value of multiple columns with many NULL values?
All of below commands dont work
AVG(col1,col2,col3)
AVG(col1)+AVG(col2)+AVG(col3) -> sum calculation alone gives wrong value because of null calculation
This question is similar to this Average of multiple columns, but is there any simple solution for PostgreSQL specific case?

Dividing AVG of column1 by AVG of column2

I am trying to divide the average value of column1 by the average value of column 2, which will give me an average price from my data. I believe there is a problem with my syntax / structure of my code, or I am making a rookie mistake.
I have searched stack and cannot find many examples of dividing two averaged columns, and checked the postgres documentation.
The individual average query is working fine (as shown here)
SELECT (AVG(CAST("Column1" AS numeric(4,2))),2) FROM table1
But when I combine two of them in an attempt to divide, It simply does not work.
SELECT (AVG(CAST("Column1" AS numeric(4,2))),2) / (AVG(CAST("Column2" AS numeric(4,2))),2) FROM table1
I am receiving the following error; "ERROR: row comparison operator must yield type boolean, not type numeric". I have tried a few other variations which have mostly given me syntax errors.
I don't know what you are trying to do with your current approach. However, if you want to take the ratio of two averages, you could also just take the ratio of the sums:
SELECT SUM(CAST(Column1 AS numeric(4,2))) / SUM(CAST(Column2 AS numeric(4,2)))
FROM table1;
Note that SUM() just takes a single input, not two inputs. The reason why we can use the sums is that average would normalize both the numerator and denominator by the same amount, which is the number of rows in table1. Hence, this factor just cancels out.

head & tail function in Tableau

What is the most effective procedure or equivalent for "head" & "tail" in Tableau if you are try to isolate the last or first items of an vector?
The known procedure e.g. in R is easy if you are looking for an average of the last 10 numbers:
mean(head(x, 10)) or mean(tail(x, 10)).
I try to find a solution in Tableau with LOD "max" & "min" but "show me the first 10 min items"?! No chance!
If you are talking about table calculations, then you can use the WINDOW_XXX() functions -- such as window_sum(sum(x), 1, 10) or window_avg(sum(x), size()-9, size())
Table calcs operate on the aggregated query results returned from the data source, hence the aggregation function sum() around the field x.
It might be simpler to learn to use Top N filters instead table calcs, depending on what problem you are trying to solve.

PgSQL - Error while executing a select

I am trying to write a simple select query in PgSQL but I keep getting an error. I am not sure what I am missing. Any help would be appreciated.
select residuals, residuals/stddev_pop(residuals)
from mySchema.results;
This gives an error
ERROR: column "results.residuals" must appear in the GROUP BY clause or be used in an aggregate function
Residuals is a numeric value (continuous variable)
What am I missing?
stddev_pop is an aggregate function. That means that it takes a set of rows as its input. Your query mentions two values in the SELECT clause:
residuals, this is a value from a single row.
stddev_pop(residuals), this is an aggregate value and represents multiple rows.
You're not telling PostgreSQL how it should choose the singular residuals value to go with the aggregate standard deviation and so PostgreSQL says that residuals
must appear in the GROUP BY clause or be used in an aggregate function
I'm not sure what you're trying to accomplish so I can't tell you how to fix your query. A naive suggestion would be:
select residuals, residuals/stddev_pop(residuals)
from mySchema.results
group by residuals
but that would leave you computing the standard deviation of groups of identical values and that doesn't seem terribly productive (especially when you're going to use the standard deviation as a divisor).
Perhaps you need to revisit the formula you're trying to compute as well as fixing your SQL.
If you want to compute the standard deviation separately and then divide each residuals by that then you'd want something like this:
select residuals,
residuals/(select stddev_pop(residuals) from mySchema.results)
from mySchema.results