Pyspark - filter to select max value - pyspark

I have a date column with column type "string". It has multiple dates and several rows of data for each date.
I'd like to filter the data to select the most recent (max) date however, when I run it, the code runs but ends up populating an empty table.
Currently I am typing in my desired date manually because I am under the impression that no form of max function will work since the column is of string type.
This is the code I am using
extract = raw_table.drop_duplicates() \
.filter(raw.as_of_date == '2022-11-25')
and what I desire to do is make this automated. Something on the lines of
.filter(raw.as_of_date == max(as_of_date)
Please advise on how to convert column type from string to date, how to code to select max date and why my hardcoding results in an empty table

you'll need to calculate the max of the date in a column and then use that column for filtering as spark does not allow aggregations in filters.
you might need something like the following
import pyspark.sql.functions as func
from pyspark.sql.window import Window as wd
raw_table.drop_duplicates(). \
withColumn('max_date', func.max('as_of_date').over(wd.partitionBy())). \
filter(func.col('as_of_date') == func.col('max_date')). \
drop('max_date')

Related

Pyspark filter a dataframe based on another dataframe containing distinct values

I have a df1 based on disticnt values containing two columns, date and value. There is df2 that has multiple column but contains the date and value column. For each distinct value from df1, i want to filter the df2 such that the records before the date from df1 are dropped. It would be rather easy for a single disticnt value, i can use something like filter by vlaue and then gt(lit(date), however i have over 500 such distinct pairs in df1. For single operation, it takes around 20 minute so if i use a loop then it is computationally not feasible. Perhaps some body can advice me on a better methodology here.
have tried multiple methodlogies but nothing has worked yet.

How to get the numeric value of missing values in a PySpark column?

I am working with the OpenFoodFacts dataset using PySpark. There's quite a lot of columns which are entirely made up of missing values and I want to drop said columns. I have been looking up ways to retrieve the number of missing values on each column, but they are displayed in a table format instead of actually giving me the numeric value of the total null values.
The following code shows the number of missing values in a column but displays it in a table format:
from pyspark.sql.functions import col, isnan, when, count
data.select([count(when(isnan("column") | col("column").isNull(), "column")]).show()
I have tried the following codes:
This one does not work as intended as it doesn't drop any columns (as expected)
for c in data.columns:
if(data.select([count(when(isnan(c) | col(c).isNull(), c)]) == data.count()):
data = data.drop(c)
data.show()
This one I am currently trying but takes ages to execute
for c in data.columns:
if(data.filter(data[c].isNull()).count() == data.count()):
data = data.drop(c)
data.show()
Is there a way to get ONLY the number? Thanks
If you need the number instead of showing in the table format, you need to use the .collect(), which is:
list_of_values = data.select([count(when(isnan("column") | col("column").isNull(), "column")]).collect()
What you get is a list of Row, which contain all the information in the table.

How to populate a Spark DataFrame column based on another column's value?

I have a use-case where I need to select certain columns from a dataframe containing atleast 30 columns and millions of rows.
I'm loading this data from a cassandra table using scala and apache-spark.
I selected the required columns using: df.select("col1","col2","col3","col4")
Now I have to perform a basic groupBy operation to group the data according to src_ip,src_port,dst_ip,dst_port and I also want to have the latest value from a received_time column of the original dataframe.
I want a dataframe with distinct src_ip values with their count and latest received_time in a new column as last_seen.
I know how to use .withColumn and also, I think that .map() can be used here.
Since I'm relatively new in this field, I really don't know how to proceed further. I could really use your help to get done with this task.
Assuming you have a dataframe df with src_ip,src_port,dst_ip,dst_port and received_time, you can try:
val mydf = df.groupBy(col("src_ip"),col("src_port"),col("dst_ip"),col("dst_port")).agg(count("received_time").as("row_count"),max(col("received_time")).as("max_received_time"))
The above line calculates the count of timestamp received against the group by columns as well as the max timestamp for that group by columns.

Map a text file to key/value pair in order to group them in pyspark

I would like to create a spark dataframe in pyspark from a text file, that has different number of rows and columns and map it to key/value pair, the key is the first 4 characters from the first column of the text file. I want to do that in order to remove the redundant rows and to be able group them later by the key value. I know how to do that on pandas but still confused where to start doing that in pyspark.
My input is a text file that has the following:
1234567,micheal,male,usa
891011,sara,femal,germany
I want to be able to group every row by the first six characters in the first column
Create a new column that contains only the first six characters of the first column, and then group by that:
from pyspark.sql.functions import col
df2 = df.withColumn("key", col("first_col")[:6])
df2.groupBy("key").agg(...)

How to fit a kernel density estimate on a pyspark dataframe column and use it for creating a new column with the estimates

My use is the following. Consider I have a pyspark dataframe which has the following format:
df.columns:
1. hh: Contains the hour of the day (type int)
2. userId : some unique identifier.
What I want to do is I want to figure out list of userIds which have anomalous hits onto the page. So I first do a groupby as so:
df=df.groupby("hh","userId).count().alias("LoginCounts)
Now the format of the dataframe would be:
1. hh
2. userId
3.LoginCounts: Number of times a specific user logs in at a particular hour.
I want to use the pyspark kde function as follows:
from pyspark.mllib.stat import KernelDensity
kd=KernelDensity()
kd.setSample(df.select("LoginCounts").rdd)
kd.estimate([13.0,14.0]).
I get the error:
Py4JJavaError: An error occurred while calling o647.estimateKernelDensity.
: org.apache.spark.SparkException: Job aborted due to stage failure
Now my end goal is to fit a kde on say a day's hour based data and then use the next day's data to get the probability estimates for each login count.
Eg: I would like to achieve something of this nature:
df.withColumn("kdeProbs",kde.estimate(col("LoginCounts)))
So the column kdeProbs will contain P(LoginCount=x | estimated kde).
I have tried searching for an example of the same but am always redirected to the standard kde example on the spark.apache.org page, which does not solve my case.
It's not enough to just select one column and convert it to an RDD; you need to also select the actual data in that column for it to work. Try this:
from pyspark.mllib.stat import KernelDensity
dat_rdd = df.select("LoginCounts").rdd
# actually select data from RDD
dat_rdd_data = dat_rdd.map(lambda x: x[0])
kd = KernelDensity()
kd.setSample(dat_rdd_data)
kd.estimate([13.0,14.0])