How to fit a kernel density estimate on a pyspark dataframe column and use it for creating a new column with the estimates - pyspark

My use is the following. Consider I have a pyspark dataframe which has the following format:
df.columns:
1. hh: Contains the hour of the day (type int)
2. userId : some unique identifier.
What I want to do is I want to figure out list of userIds which have anomalous hits onto the page. So I first do a groupby as so:
df=df.groupby("hh","userId).count().alias("LoginCounts)
Now the format of the dataframe would be:
1. hh
2. userId
3.LoginCounts: Number of times a specific user logs in at a particular hour.
I want to use the pyspark kde function as follows:
from pyspark.mllib.stat import KernelDensity
kd=KernelDensity()
kd.setSample(df.select("LoginCounts").rdd)
kd.estimate([13.0,14.0]).
I get the error:
Py4JJavaError: An error occurred while calling o647.estimateKernelDensity.
: org.apache.spark.SparkException: Job aborted due to stage failure
Now my end goal is to fit a kde on say a day's hour based data and then use the next day's data to get the probability estimates for each login count.
Eg: I would like to achieve something of this nature:
df.withColumn("kdeProbs",kde.estimate(col("LoginCounts)))
So the column kdeProbs will contain P(LoginCount=x | estimated kde).
I have tried searching for an example of the same but am always redirected to the standard kde example on the spark.apache.org page, which does not solve my case.

It's not enough to just select one column and convert it to an RDD; you need to also select the actual data in that column for it to work. Try this:
from pyspark.mllib.stat import KernelDensity
dat_rdd = df.select("LoginCounts").rdd
# actually select data from RDD
dat_rdd_data = dat_rdd.map(lambda x: x[0])
kd = KernelDensity()
kd.setSample(dat_rdd_data)
kd.estimate([13.0,14.0])

Related

reduce function not working in derived column in adf mapping data flow

I am trying to create the derived column based on the condition that met the value and trying to do the summation of multiple matching column values dynamically. So I am using reduce function in ADF derived column mapping data flow. But the column is not getting created even the transformation is correct.
Columns from source
Derived column logic
Derived column data preview without the new columns as per logic
I could see only the fields from source but not the derived column fields. If I use only the array($$) I could see the fields getting created.
Derived column data preview with logic only array($$)
How to get the derived column with the summation of all the fields matching the condition?
We are getting data of 48 weeks forecast and the data to be prepared on monthly basis.
eg: Input data
Output data:
JAN
----
506 -- This is for first record i.e. (94 + 105 + 109 + 103 + 95)
The problem is that the array($$) in the reduce function has only one element, so that the reduce function can not accumulate the content of the matching columns correctly.
You can solve this by using two derived columns and a data flow parameter as follows:
Create derived columns with pattern matching for each month-week you did it before, but put the reference $$ into the value field, instead of the reduce(...) function.
This will create derived columns like jan0, jan1, etc. containing the copy of the original values. For example Week 0 (1 Jan - 7 Jan) => 0jan with value 95.
This step gives you a predefined set of column names for each week, which you can use to summarize the values with specific column names.
Define Data Flow parameters for each month containing the month-week column names in a string array, like this:
ColNamesJan=['0jan' ,'1jan', etc.] ColNamesFeb=['0feb' ,'1feb', etc.] and so on.
You will use these column names in a reduce function to summarize the month-week columns to monthly column in the next step.
Create a derived column for each month, which will contain the monthly totals, and use the following reduce function to sum the weekly values:
reduce(array(byNames($ColNamesJan)), 0, #acc + toInteger(toString(#item)),#result)
Replace the parameter name accordingly.
I was able to summarize the columns dynamically with the above solution.
Please let me know if you need more information (e.g. screenshots) to reproduce the solution.
Update -- Here are the screenshots from my test environment.
Data source (data preview):
Derived columns with pattern matching (settings)
Derived columns with pattern matching (data preview)
Data flow parameter:
Derived column for monthly sum (settings):
Derived column for monthly sum (data preview):

Pyspark - filter to select max value

I have a date column with column type "string". It has multiple dates and several rows of data for each date.
I'd like to filter the data to select the most recent (max) date however, when I run it, the code runs but ends up populating an empty table.
Currently I am typing in my desired date manually because I am under the impression that no form of max function will work since the column is of string type.
This is the code I am using
extract = raw_table.drop_duplicates() \
.filter(raw.as_of_date == '2022-11-25')
and what I desire to do is make this automated. Something on the lines of
.filter(raw.as_of_date == max(as_of_date)
Please advise on how to convert column type from string to date, how to code to select max date and why my hardcoding results in an empty table
you'll need to calculate the max of the date in a column and then use that column for filtering as spark does not allow aggregations in filters.
you might need something like the following
import pyspark.sql.functions as func
from pyspark.sql.window import Window as wd
raw_table.drop_duplicates(). \
withColumn('max_date', func.max('as_of_date').over(wd.partitionBy())). \
filter(func.col('as_of_date') == func.col('max_date')). \
drop('max_date')

How to get the numeric value of missing values in a PySpark column?

I am working with the OpenFoodFacts dataset using PySpark. There's quite a lot of columns which are entirely made up of missing values and I want to drop said columns. I have been looking up ways to retrieve the number of missing values on each column, but they are displayed in a table format instead of actually giving me the numeric value of the total null values.
The following code shows the number of missing values in a column but displays it in a table format:
from pyspark.sql.functions import col, isnan, when, count
data.select([count(when(isnan("column") | col("column").isNull(), "column")]).show()
I have tried the following codes:
This one does not work as intended as it doesn't drop any columns (as expected)
for c in data.columns:
if(data.select([count(when(isnan(c) | col(c).isNull(), c)]) == data.count()):
data = data.drop(c)
data.show()
This one I am currently trying but takes ages to execute
for c in data.columns:
if(data.filter(data[c].isNull()).count() == data.count()):
data = data.drop(c)
data.show()
Is there a way to get ONLY the number? Thanks
If you need the number instead of showing in the table format, you need to use the .collect(), which is:
list_of_values = data.select([count(when(isnan("column") | col("column").isNull(), "column")]).collect()
What you get is a list of Row, which contain all the information in the table.

Inconsistent count after window lead function, and filter

Edit 2:
I've reported this as an issue to spark developers, I will post status here when I get some.
I have a problem that been bothering me for quite some time now.
Imagine you have a dataframe with several milions of records, with these columns:
df1:
start(timestamp)
user_id(int)
type(string)
I need to define duration between two rows, and filter on that duration and type.
I used window lead function to get the next event time (that define end for current event), so every row now gets start and stop times.
If NULL (last row for example), add next midnight as stop.
Data is stored in ORC file (tried with Parquet format, no difference)
This only happens with multiple executors cluster nodes, for example AWS EMR cluster or local docker cluster setup.
If I run it on single instance (local on laptop), I get consistent results every time.
Spark version is 3.0.1, both in AWS and local and docker setup.
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("user_id").orderBy("start")
val ts_lead = coalesce(lead("start", 1) .over(w), date_add(col("start"), 1))
val df2 = df1.
withColumn("end", ts_lead).
withColumn("duration", col("end").cast("long")-col("start").cast("long"))
df2.where("type='B' and duration>4").count()
Every time I run this last count, I get different results.
For example:
run 1: 19359949
run 2: 19359964
If I run every filter separately, everything is OK and I get consistent results.
But If I combine them, inconsistent.
I tried to filter to separate dataframe, first duration then type and vice versa, no joy there also.
I know that I can cache or checkpoint datframe, but it's very large dataset and I have similar calculations multiple times, so I can't really spare time and disk space for checkpoints and cache.
Is this a bug in spark, or am I missing something?
Edit:
I have created sample code with dummy random data, so anyone can try to reproduce.
Since this sample use random numbers, it's necessary to write dataset after generation and re-read it.
I user for loop to generate set because when I tried to generate 25.000.000 rows in one pass, I got out of memory error.
I saved it to s3://bucket , here it's masked with your-bucket
import org.apache.spark.sql.expressions.Window
val getRandomUser = udf(()=>{
val users = Seq("John","Eve","Anna","Martin","Joe","Steve","Katy")
users(scala.util.Random.nextInt(7))
})
val getRandomType = udf(()=>{
val types = Seq("TypeA","TypeB","TypeC","TypeD","TypeE")
types(scala.util.Random.nextInt(5))
})
val getRandomStart = udf((x:Int)=>{
x+scala.util.Random.nextInt(47)
})
for( a <- 0 to 23){
// use iterator a to continue with next million, repeat 1 mil times
val x=Range(a*1000000,(a*1000000)+1000000).toDF("id").
withColumn("start",getRandomStart(col("id"))).
withColumn("user",getRandomUser()).
withColumn("type",getRandomType()).
drop("id")
x.write.mode("append").orc("s3://your-bucket/random.orc")
}
val w = Window.partitionBy("user").orderBy("start")
val ts_lead = coalesce(lead("start", 1) .over(w), lit(30000000))
val fox2 = spark.read.orc("s3://your-bucket/random.orc").
withColumn("end", ts_lead).
withColumn("duration", col("end")-col("start"))
// repeated executions of this line returns different results for count
fox2.where("type='TypeA' and duration>4").count()
My results for three consecutive runs of last line were:
run 1: 2551259
run 2: 2550756
run 3: 2551279
Every run different count
I have reproduced your issue locally.
As far as I understand, the issue is that you are filtering by duration in this sentence:
fox2.where("type='TypeA' and duration>4").count()
and duration is generated randomly. I understand that you are using a seed, but if you parallelise that, you do not know which random value will be added to each id.
For example, if 4 generated numbers were 21, 14, 5, 17, and the ids were 1, 2, 3, 4, the start column sometimes could be:
1 + 21
2 + 14
3 + 5
4 + 17
and sometimes could be:
1 + 21
4 + 14
3 + 5
2 + 17
this will lead to different start values, and hence different duration values, ultimately leading to changes in the final filter and count because order in dataframes is not guaranteed when running in parallel.

Create sample value for failure records spark

I have a scenario where my dataframe has 3 columns a,b and c. I need to validate if the length of all the columns is equal to 100. Based on validation I am creating status column like a_status,b_status,c_status with values 5 (Success) and 10 (Failure). In Failure scenarios I need to update count and create new columns a_sample,b_sample,c_sample with some 5 failure sample values separated by ",". For creating samples column I tried like this
df= df.select(df.columns.toList.map(col(_)) :::
df.columns.toList.map( x => (lit(getSample(df.select(x, x + "_status").filter(x + "_status=10" ).select(x).take(5))).alias(x + "_sample")) ).toList: _* )
getSample method will just get array of rows and concatenate as a string. This works fine for limited columns and data size. However if the number of columns > 200 and data is > 1 million rows it creates huge performance impact. Is there any alternate approach for same.
While the details of your problem statement are unclear, you can break up the task into two parts:
Transform data into a format where you identify several different types of rows you need to sample.
Collect sample by row type.
The industry jargon for "row type" is stratum/strata and the way to do (2), without collecting data to the driver, which you don't want to do when the data is large, is via stratified sampling, which Spark implements via df.stat.sampleBy(). As a statistical function, it doesn't work with exact row numbers but fractions. If you absolutely must get a sample with an exact number of rows there are two strategies:
Oversample by fraction and then filter unneeded rows, e.g., using the row_number() window function followed by a filter 'row_num < n.
Build a custom user-defined aggregate function (UDAF), firstN(col, n). This will be much faster but a lot more work. See https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html
An additional challenge for your use case is that you want this done per column. This is not a good fit with Spark's transformations such as grouping or sampleBy, which operate on rows. The simple approach is to make several passes through the data, one column at a time. If you absolutely must do this in a single pass through the data, you'll need to build a much more custom UDAF or Aggregator, e.g., the equivalent of takeFirstNFromAWhereBHasValueC(n, colA, colB, c).