Systematic sampling in PySpark - pyspark

I’m quite new to PySpark and I’ve been struggling to find the answer I’m looking for.
I have a large sample of households and I want to conduct systematic sampling. Like true systematic sampling, I would like to begin at a random starting point and then select a household at regular intervals (e.g. every 50th household). I have looked into sample() and sampleBy(), but I don't think these are quite what I need. Can anyone give any advice on how I can do this? Many thanks in advance for your help!

monotonically_increasing_id works if you have only 1 partition, so if you have more than just 1 partition, you can consider row_number.
Check "Notes" in https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html
With row_number,
from pyspark.sql import functions as F
df = df.withColumn("index", F.row_number().over(Window.orderBy('somecol')))
.filter(((F.col('index') + random_start) % 50) == 0)

You might want to use monotonically_increasing_id then modulo by 50 to get what you want.

Related

How can I optimize calculation of mean in pyspark while ignoring Null values

I have a Pyspark Dataframe with around 4 billion rows, so efficiency in operations is very important. What I want to do seems very simple. I want to calculate the average value from two columns, and if one of them is Null I want to only return the non-null value. In Python I could easily accomplish this using np.nanmean, but I do not believe anything similar is implemented in Pyspark.
To clarify the behavior I am expecting, please see the below example rows:
user_id col_1 col_2 avg_score
1 32 12 22
2 24 None 24
Below is my currently implementation. Note that all values in col_1 are guaranteed to be non-null. I believe this can probably be further optimized:
from pyspark.sql import functions as f_
spark_df = spark_df.na.fill(0, 'col_2')
spark_df = spark_df.withColumn('avg_score',
sum([spark_df[i] for i in ['col_1','col_2']) /
sum([f_.when(spark_df[i] > 0, 1).otherwise(0) for i in ['col_1','col_2']]))
If anyone has any suggestions for whether there is a more efficient way to calculate this I would really appreciate it.

Calculating median of column "Balance" from table "Marketing"

I have a spark (scala) dataframe "Marketing" with approx 17 columns with 1 of them as "Balance". The data type of this column is Int. I need to find the median Balance. I can do upto arranging it in ascending order, but how to proceed after that? I have a given hint that the percentile function of scala can be used. I don't have any idea about this percentile function. Can anyone help?
Median is the same thing as the 50th percentile. If you do not mind using hive functions you can do one of the following:
marketingDF.selectExpr("percentile(CAST(Balance AS BIGINT), 0.5) AS median")
If you do not need an exact figure you can look into using percentile_approx() instead.
Documentation for both functions is located here.

Removing non-matching dates from two time series in matlab

I have two time series x and y which roughly cover the same period of time. The data is in daily form however there are some days that have data in one dataset but no data in the other. I wish to use matlab to create two data-sets of equal size with matching dates. Essentially I wish to remove the days that don't have data in both x and y. Is there a simple way to do this? Thanks.
You could use an inner join see help join if you are able to convert your timeseries into datasets. If not you could use the ismember function, but this time you should do it only on the dates.
Something like this will work:
a = {'2015-01-01', '2015-02-02', '2015-03-03'};
b = {'2015-01-01', '2015-03-03', '2015-04-04'};
newA = a(ismember(a,b));
newB = b(ismember(b,a));

Find the maximum of every 60 elements in my data-MATLAB

I have a vector that contains a long list of data (time series). I would like to find the maximum of every 60 elements without going through manually C=[max(B(1:60)), etc... ] because it is a rather large data set. Is there a clean way of doing this? Thanks for any ideas! I appreciate it.
Oli's suggestion deserves to be made into a formal answer. Try this:
C = max(reshape(B,60,[]));
As another option you can look at blkproc.
A= randn(600,1);
blkproc( A, [60,1], 'max');
blkproc is being phased out, so you will also have to look at blockproc.
Though, reshaping and taking the max will probably be more efficient as was mentioned in the comments.
max( reshape(A, [60, 10] ) )
[update]
As a note... don't use blkproc :-). Using a very large array (A), blkproc is 100x slower than the max,reshape.
You can also use 'buffer' function.
A= randn(600,1);
max(buffer(A,60));
This solution works even when the length of the vector is not exact multiple of 60 and is faster in comparison to 'reshape' function.

Weighted Average Fields

I'm totally new to doing calculations in T-SQL. I guess I'm wondering what is a weighted average and how do you do it in T-SQL for a field?
First off as far as I know a weighted average is simply just multiplying 2 columns then average it by dividing by something.
Here's an example of a calculated field I have in my view, after calling one of our UDFs. Now this field in my view needs to also be a weighted average....no idea where to start to turn this into a weighted average.
So ultimately this UDF returns the AverageCostG. I call the UDF from my view and so here's the guts of the UDF:
Set #AverageCostG = ((#AvgFullYear_Rent * #Months) +
(#PrevYearRent * #PrevYearMonths))
/ #Term
so in my view I'm calling the UDF above to get back that #AverageCostG
CREATE View MyTestView
AS
select v.*, --TODO: get rid of *, that's just for testing, select actual field names
CalculateAvgRentG(d.GrossNet, d.BaseMonthlyRent, d.ILI, d.Increase, d.Term) as AverageRent_G,
....
from SomeOtherView v
Now I need to make this AverageRent_G calc field in my view also a weighted average somehow...
Do I need to know WHAT they want weighted or is it assumed that hey, it's obvious.. I do not know what I need to know in order to do the weighted average for these guys...like what specs I need if any from them other than this calculation I've created based off the UDF call. I mean do I have to do some crazy select join or something in addition to multiplying 2 fields and dividing by something to average it? How do I know what fields they are to be used int he weighted average and from where? I will openly admit I'm totally new to BI T-SQL development as I'm an ASP.NET MVC C#/Architect dev...and lost to this calculation stuff in T-SQL.
I have tried to research this but just need some basic hand holding the first time through this, my head hurts right now caue I don't know what info I need to obtain from them and then what to do exactly to make that calc field weighted.
They'll have to tell you what the weighting factor is. This is my guess.
SUM([weight] * CalculateAvgRentG(...)) / SUM([weight])