Pyspark filter a dataframe based on another dataframe containing distinct values - pyspark

I have a df1 based on disticnt values containing two columns, date and value. There is df2 that has multiple column but contains the date and value column. For each distinct value from df1, i want to filter the df2 such that the records before the date from df1 are dropped. It would be rather easy for a single disticnt value, i can use something like filter by vlaue and then gt(lit(date), however i have over 500 such distinct pairs in df1. For single operation, it takes around 20 minute so if i use a loop then it is computationally not feasible. Perhaps some body can advice me on a better methodology here.
have tried multiple methodlogies but nothing has worked yet.

Related

PySpark: How to count the number of distinct values from two columns?

I have a DataFrame with two columns, id1, id2 and what I'd like to get is to count the number of distinct values of these two columns. Essentially this is count(set(id1+id2)).
How can I do that with PySpark?
Thanks!
Please note that this isn't a duplicate as I'd like for PySpark to calculate the count(). Of course it's possible to get the two lists id1_distinct and id2_distinct and put them in a set() but it doesn't seem to me the proper solution when dealing with big data and it's not really in the PySpark spirit
You can combine the two columns into one using union, and get the countDistinct:
import pyspark.sql.functions as F
cnt = df.select('id1').union(df.select('id2')).select(F.countDistinct('id1')).head()[0]

How to populate a Spark DataFrame column based on another column's value?

I have a use-case where I need to select certain columns from a dataframe containing atleast 30 columns and millions of rows.
I'm loading this data from a cassandra table using scala and apache-spark.
I selected the required columns using: df.select("col1","col2","col3","col4")
Now I have to perform a basic groupBy operation to group the data according to src_ip,src_port,dst_ip,dst_port and I also want to have the latest value from a received_time column of the original dataframe.
I want a dataframe with distinct src_ip values with their count and latest received_time in a new column as last_seen.
I know how to use .withColumn and also, I think that .map() can be used here.
Since I'm relatively new in this field, I really don't know how to proceed further. I could really use your help to get done with this task.
Assuming you have a dataframe df with src_ip,src_port,dst_ip,dst_port and received_time, you can try:
val mydf = df.groupBy(col("src_ip"),col("src_port"),col("dst_ip"),col("dst_port")).agg(count("received_time").as("row_count"),max(col("received_time")).as("max_received_time"))
The above line calculates the count of timestamp received against the group by columns as well as the max timestamp for that group by columns.

multilevel sorting of column data(string datatype) in spark SQL

for example lets consider there are three columns namely empid,year of joining and designation in employee table. initially the table is sorted based on the joining date in order. if in case two people have joined on same date then the table must have higher priority designation on top and down the lesser ones.
how to assign priority for the designation for the already sorted data in spark sql dataframe.
example if ceo and a project manager has joined an company on same date then the ceo details must be on top of projectmanager viewed in dataframe schema.
Let say you have a Dataframe with 3 columns - empid, joiningYear, designation. Then you can do something like this to sort on multiple columns:
import org.apache.spark.sql.functions._
val sorted = df.sort(asc("joiningYear"), asc("designation"))
In this case, data will firstly be sorted on joiningYear and for people with same joiningYear, it will be sorted by designation.
Considering there are three columns namely empid,year_of_joining and designation in employee table. You can sort or orderBy both columns but how will you order based on designation if it has e.g. "CEO", "projectManager". Using designation within sort() will sort it in its alphabetical order. So you must have some increasing numbers denoting the designation based on seniority then simply use below code.
import org.apache.spark.sql.functions._
val sortedEmp = df.sort(asc("year_of_joining"), desc("designation "))
Since you want higher priority designation on top and down the lesser ones so you should use desc for designation.
Since there wont be many designations, you can assign increasing numbers to these designations based on seniority.

Map a text file to key/value pair in order to group them in pyspark

I would like to create a spark dataframe in pyspark from a text file, that has different number of rows and columns and map it to key/value pair, the key is the first 4 characters from the first column of the text file. I want to do that in order to remove the redundant rows and to be able group them later by the key value. I know how to do that on pandas but still confused where to start doing that in pyspark.
My input is a text file that has the following:
1234567,micheal,male,usa
891011,sara,femal,germany
I want to be able to group every row by the first six characters in the first column
Create a new column that contains only the first six characters of the first column, and then group by that:
from pyspark.sql.functions import col
df2 = df.withColumn("key", col("first_col")[:6])
df2.groupBy("key").agg(...)

Create sample value for failure records spark

I have a scenario where my dataframe has 3 columns a,b and c. I need to validate if the length of all the columns is equal to 100. Based on validation I am creating status column like a_status,b_status,c_status with values 5 (Success) and 10 (Failure). In Failure scenarios I need to update count and create new columns a_sample,b_sample,c_sample with some 5 failure sample values separated by ",". For creating samples column I tried like this
df= df.select(df.columns.toList.map(col(_)) :::
df.columns.toList.map( x => (lit(getSample(df.select(x, x + "_status").filter(x + "_status=10" ).select(x).take(5))).alias(x + "_sample")) ).toList: _* )
getSample method will just get array of rows and concatenate as a string. This works fine for limited columns and data size. However if the number of columns > 200 and data is > 1 million rows it creates huge performance impact. Is there any alternate approach for same.
While the details of your problem statement are unclear, you can break up the task into two parts:
Transform data into a format where you identify several different types of rows you need to sample.
Collect sample by row type.
The industry jargon for "row type" is stratum/strata and the way to do (2), without collecting data to the driver, which you don't want to do when the data is large, is via stratified sampling, which Spark implements via df.stat.sampleBy(). As a statistical function, it doesn't work with exact row numbers but fractions. If you absolutely must get a sample with an exact number of rows there are two strategies:
Oversample by fraction and then filter unneeded rows, e.g., using the row_number() window function followed by a filter 'row_num < n.
Build a custom user-defined aggregate function (UDAF), firstN(col, n). This will be much faster but a lot more work. See https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html
An additional challenge for your use case is that you want this done per column. This is not a good fit with Spark's transformations such as grouping or sampleBy, which operate on rows. The simple approach is to make several passes through the data, one column at a time. If you absolutely must do this in a single pass through the data, you'll need to build a much more custom UDAF or Aggregator, e.g., the equivalent of takeFirstNFromAWhereBHasValueC(n, colA, colB, c).