Iterate over Dataframe & Recursive filters - pyspark

I have 2 dataframes. "MinNRule" & "SampleData"
MinNRule provides some rule information based on which SampleData needs to be:
Aggregate "Sample Data" on columns defined in MinNRule.MinimumNPopulation and MinNRule.OrderOfOperation
Check if Aggregate.Entity >= MinNRule.MinimumNValue
a. For all Entities that do not meet the MinNRule.MinimumNValue, remove from population
b. For all Entities that meet the MinNRule.MinimumNValue, keep in population
Perform 1 through 2 for next MinNRule.OrderOfOperation using 2.b dataset
MinNRule
| MinimumNGroupName | MinimumNPopulation | MinimumNValue | OrderOfOperation |
|:-----------------:|:------------------:|:-------------:|:----------------:|
| Group1 | People by Facility | 6 | 1 |
| Group1 | People by Project | 4 | 2 |
SampleData
| Facility | Project | PeopleID |
|:--------: |:-------: |:--------: |
| F1 | P1 | 166152 |
| F1 | P1 | 425906 |
| F1 | P1 | 332127 |
| F1 | P1 | 241630 |
| F1 | P2 | 373865 |
| F1 | P2 | 120672 |
| F1 | P2 | 369407 |
| F2 | P4 | 121705 |
| F2 | P4 | 211807 |
| F2 | P4 | 408041 |
| F2 | P4 | 415579 |
Proposed Steps:
Read MinNRule, read rule with OrderOfOperation=1
a. GroupBy Facility, Count on People
b. Aggregate SampleData by 1.a and compare to MinimumNValue=6
| Facility | Count | MinNPass |
|:--------: |:-------: |:--------: |
| F1 | 7 | Y |
| F2 | 4 | N |
Select MinNPass='Y' rows and filter the initial dataframe down to those entities (F2 gets dropped)
| Facility | Project | PeopleID |
|:--------: |:-------: |:--------: |
| F1 | P1 | 166152 |
| F1 | P1 | 425906 |
| F1 | P1 | 332127 |
| F1 | P1 | 241630 |
| F1 | P2 | 373865 |
| F1 | P2 | 120672 |
| F1 | P2 | 369407 |
Read MinNRule, read rule with OrderOfOperation=2
a. GroupBy Project, Count on People
b. Aggregate SampleData by 3.a and compare to MinimumNValue=4
| Project | Count | MinNPass |
|:--------: |:-------: |:--------: |
| P1 | 4 | Y |
| P2 | 3 | N |
Select MinNPass='Y' rows and filter dataframe in 3 down to those entities (P2 gets dropped)
Print Final Result
| Facility | Project | PeopleID |
|:--------: |:-------: |:--------: |
| F1 | P1 | 166152 |
| F1 | P1 | 425906 |
| F1 | P1 | 332127 |
| F1 | P1 | 241630 |
Ideas:
I have been thinking of moving MinNRule to a LocalIterator and loopinng through it and "filtering" SampleData
I am not sure how to pass the result at the end of one loop over to another
Still learning Pyspark, unsure if this is the correct approach.
I am using Azure Databricks

IIUC, since the rules df defines the rules therefore it must be small and can be collected to the driver for performing the operations on the main data.
One approach to get the desired result can be by collecting the rules df and passing it to the reduce function as:
data = MinNRule.orderBy('OrderOfOperation').collect()
from pyspark.sql.functions import *
from functools import reduce
dfnew = reduce(lambda df, rules: df.groupBy(col(rules.MinimumNPopulation.split('by')[1].strip())).\
agg(count(col({'People':'PeopleID'}.get(rules.MinimumNPopulation.split('by')[0].strip()))).alias('count')).\
filter(col('count')>=rules.MinimumNValue).drop('count').join(df,rules.MinimumNPopulation.split('by')[1].strip(),'inner'), data, sampleData)
dfnew.show()
+-------+--------+--------+
|Project|Facility|PeopleID|
+-------+--------+--------+
| P1| F1| 166152|
| P1| F1| 425906|
| P1| F1| 332127|
| P1| F1| 241630|
+-------+--------+--------+
Alternatively you can also loop through the df and get the result the performance remains same in both the cases
import pyspark.sql.functions as f
mapped_cols = {'People':'PeopleID'}
data = MinNRule.orderBy('OrderOfOperation').collect()
for i in data:
cnt, grp = i.MinimumNPopulation.split('by')
cnt = mapped_cols.get(cnt.strip())
grp = grp.strip()
sampleData = sampleData.groupBy(f.col(grp)).agg(f.count(f.col(cnt)).alias('count')).\
filter(f.col('count')>=i.MinimumNValue).drop('count').join(sampleData,grp,'inner')
sampleData.show()
+-------+--------+--------+
|Project|Facility|PeopleID|
+-------+--------+--------+
| P1| F1| 166152|
| P1| F1| 425906|
| P1| F1| 332127|
| P1| F1| 241630|
+-------+--------+--------+
Note: You have to manually parse your rules grammar as it is subjected to change

Related

How to replace null values in a dataframe based on values in other dataframe?

Here's a dataframe, df1, I have
+---------+-------+---------+
| C1 | C2 | C3 |
+---------+-------+---------+
| xr | 1 | ixfg |
| we | 5 | jsfd |
| r5 | 7 | hfga |
| by | 8 | srjs |
| v4 | 4 | qwks |
| c0 | 0 | khfd |
| ba | 2 | gdbu |
| mi | 1 | pdlo |
| lp | 7 | ztpq |
+---------+-------+---------+
Here's another, df2, that I have
+----------+-------+---------+
| V1 | V2 | V3 |
+----------+-------+---------+
| Null | 6 | ixfg |
| Null | 2 | jsfd |
| Null | 2 | hfga |
| Null | 7 | qwks |
| Null | 1 | khfd |
| Null | 9 | gdbu |
+----------+-------+---------+
What I would like to have is another dataframe that
Ignores values in V2 and takes values in C2 whereever V3 and C3 match, and
Replaces V1 with values in C1 wherever V3 and C3 match.
The result should look like the following:
+----------+-------+---------+
| M1 | M2 | M3 |
+----------+-------+---------+
| xr | 1 | ixfg |
| we | 5 | jsfd |
| r5 | 7 | hfga |
| v4 | 4 | qwks |
| c0 | 0 | khfd |
| ba | 2 | gdbu |
+----------+-------+---------+
You can join and use coalesce to take a value which has a higher priority.
** coalesce will take any number of columns (the highest priority to least in the order of arguments) and return first non-null value, so if you do want to replace with null when there is a null in the lower priority column, you cannot use this function.
df = (df1.join(df2, on=(df1.C3 == df2.V3))
.select(F.coalesce(df1.C1, df2.V1).alias('M1'),
F.coalesce(df2.V2, df1.C2).alias('M2'),
F.col('C3').alias('M3')))

Compare consecutive rows and extract words(excluding the subsets) in spark

I am working on a spark dataframe. Input dataframe looks like below (Table 1). I need to write a logic to get the keywords with maximum length for each session ids. There are multiple keywords that would be part of output for each sessionid. expected output looks like Table 2.
Input dataframe:
(Table 1)
|-----------+------------+-----------------------------------|
| session_id| value | Timestamp |
|-----------+------------+-----------------------------------|
| 1 | cat | 2021-01-11T13:48:54.2514887-05:00 |
| 1 | catc | 2021-01-11T13:48:54.3514887-05:00 |
| 1 | catch | 2021-01-11T13:48:54.4514887-05:00 |
| 1 | par | 2021-01-11T13:48:55.2514887-05:00 |
| 1 | part | 2021-01-11T13:48:56.5514887-05:00 |
| 1 | party | 2021-01-11T13:48:57.7514887-05:00 |
| 1 | partyy | 2021-01-11T13:48:58.7514887-05:00 |
| 2 | fal | 2021-01-11T13:49:54.2514887-05:00 |
| 2 | fall | 2021-01-11T13:49:54.3514887-05:00 |
| 2 | falle | 2021-01-11T13:49:54.4514887-05:00 |
| 2 | fallen | 2021-01-11T13:49:54.8514887-05:00 |
| 2 | Tem | 2021-01-11T13:49:56.5514887-05:00 |
| 2 | Temp | 2021-01-11T13:49:56.7514887-05:00 |
|-----------+------------+-----------------------------------|
Expected Output:
(Table 2)
|-----------+------------+
| session_id| value |
|-----------+------------+
| 1 | catch |
| 1 | partyy |
| 2 | fallen |
| 2 | Temp |
|-----------+------------|
Solution I tried:
I added another column called col_length which captures the length of each word in value column. later on tried to compare each row with its subsequent row to see if it is of maximum lenth. But this solution only works party.
val df = spark.read.parquet("/project/project_name/abc")
val dfM = df.select($"session_id",$"value",$"Timestamp").withColumn("col_length",length($"value"))
val ts = Window
.orderBy("session_id")
.rangeBetween(Window.unboundedPreceding, Window.currentRow)
val result = dfM
.withColumn("running_max", max("col_length") over ts)
.where($"running_max" === $"col_length")
.select("session_id", "value", "Timestamp")
Current Output:
|-----------+------------+
| session_id| value |
|-----------+------------+
| 1 | catch |
| 2 | fallen |
|-----------+------------|
Multiple columns does not work inside an orderBy clause with window function so I didn't get desired output.I got 1 output per sesison id. Any suggesions would be highly appreciated. Thanks in advance.
You can solve it by using lead function:
val windowSpec = Window.orderBy("session_id")
dfM
.withColumn("lead",lead("value",1).over(windowSpec))
.filter((functions.length(col("lead")) < functions.length(col("value"))) || col("lead").isNull)
.drop("lead")
.show

Check if a value is between two columns, spark scala

I have two dataframes, one with my data and another one to compare. What I want to do is check if a value is in a range of two different columns, for example:
Df_player
+--------+-------+
| Baller | Power |
+--------+-------+
| John | 1.5 |
| Bilbo | 3.7 |
| Frodo | 6 |
+--------+-------+
Df_Check
+--------+--------+--------+
| First | Second | Value |
+--------+--------+--------+
| 1 | 1.5 | Bad- |
| 1.5 | 3 | Bad |
| 3 | 4.2 | Good |
| 4.2 | 6 | Good+ |
+--------+--------+--------+
The result would be:
Df_out
+--------+-------+--------+
| Baller | Power | Value |
+--------+-------+--------+
| John | 1.5 | Bad- |
| Bilbo | 3.7 | Good |
| Frodo | 6 | Good+ |
+--------+-------+--------+
You can do a join based on a between condition, but note that .between is not appropriate here because you want inequality in one of the comparisons:
val result = df_player.join(
df_check,
df_player("Power") > df_check("First") && df_player("Power") <= df_check("Second"),
"left"
).select("Baller", "Power", "Value")
result.show
+------+-----+-----+
|Baller|Power|Value|
+------+-----+-----+
| John| 1.5| Bad-|
| Bilbo| 3.7| Good|
| Frodo| 6.0|Good+|
+------+-----+-----+

PostgreSQL: Transforming rows into columns when more than three columns are needed

I have a table like the following one:
+---------+-------+-------+-------------+--+
| Section | Group | Level | Fulfillment | |
+---------+-------+-------+-------------+--+
| A | Y | 1 | 82.2 | |
| A | Y | 2 | 23.2 | |
| A | M | 1 | 81.1 | |
| A | M | 2 | 28.2 | |
| B | Y | 1 | 89.1 | |
| B | Y | 2 | 58.2 | |
| B | M | 1 | 32.5 | |
| B | M | 2 | 21.4 | |
+---------+-------+-------+-------------+--+
And this would be my desired output:
+---------+-------+--------------------+--------------------+
| Section | Group | Level1_Fulfillment | Level2_Fulfillment |
+---------+-------+--------------------+--------------------+
| A | Y | 82.2 | 23.2 |
| A | M | 81.1 | 28.2 |
| B | Y | 89.1 | 58.2 |
| B | M | 32.5 | 21.4 |
+---------+-------+--------------------+--------------------+
Thus, for each section and group I'd like to obtain their percents of fulfillment for level 1 and level 2. To achieve this, I've tried crosstab(), but using this function returns me an error ("The provided SQL must return 3 columns: rowid, category, and values.") because I'm using more than three columns (I need to maintain section and group as identifiers for each row). Is possible to use crosstab in this case?
Regards.
I find crosstab() unnecessary complicated to use and prefer conditional aggregation:
select section,
"group",
max(fulfillment) filter (where level = 1) as level_1,
max(fulfillment) filter (where level = 2) as level_2
from the_table
group by section, "group"
order by section;
Online example

Group by certain record in array (pyspark)

I want to group a data in such a way that for particular record each array values also used to group for that record
I am able to group by name only. I am not able to figure out the way to this.
I have tried following query;
import pyspark.sql.functions as f
df.groupBy('name').agg(f.collect_list('data').alias('data_new')).show()
Following is the dataframe;
|-------|--------------------|
| name | data |
|-------|--------------------|
| a | [a,b,c,d,e,f,g,h,i]|
| b | [b,c,d,e,j,k] |
| c | [c,f,l,m] |
| d | [k,b,d] |
| n | [n,o,p,q] |
| p | [p,r,s,t] |
| u | [u,v,w,x] |
| b | [b,f,e,g] |
| c | [c,b,g,h] |
| a | [a,l,f,m] |
|----------------------------|
I am expecting following output;
|-------|----------------------------|
| name | data |
|-------|----------------------------|
| a | [a,b,c,d,e,f,g,h,i,j,k,l,m]|
| n | [n,o,p,q,r,s,t] |
| u | [u,v,w,x] |
|-------|----------------------------|