trim dataframe in spark using last appearance of value in the column - scala

I have dataframe where I want to trim it by last appearance of value Good in column PDP. This is to consider rows 5 and below. Anything above row 5 does not matter.
+------+----+
|custId| PDP|
| 1001| New|
| 1002|Good|
| 1003| New|
| 1004| New|
| 1005|Good|
| 1006| New|
| 1007| New|
| 1008| New|
| 1009| New|
+------+----+
What i need is this dataframe. Since last Good action happened on row 5th
+------+----+
|custId| PDP|
| 1001| New|
| 1002|Good|
| 1003| New|
| 1004| New|
| 1005|Good|
+------+----+

You can try:
df
.filter($"PDP" === "Good") // Filter good
.select(max("custId").alias("maxId")) // Find max id
.crossJoin(df)
.where($"custId" <= $"maxId") // Filter records with id <= lastGoodId
.drop("maxId") // Remove obsolete column

You have to find the last row index with Good in PDP column, and then filter in only rows less than that index.
custId
If your custId column contains increasing ids in sorted order then you can do the following
import org.apache.spark.sql.functions._
val maxIdToFilter = df.filter(lower(col("PDP")) === "good").select(max(col("custId").cast("long"))).first().getLong(0)
df.filter(col("custId") <= maxIdToFilter).show(false)
monotically_increasing_id
If your custId is not sorted and increasing order then you can use following logic
import org.apache.spark.sql.functions._
val dfWithRow = df.withColumn("rowNo", monotonically_increasing_id())
val maxIdToFilter = dfWithRow.filter(lower(col("PDP")) === "good").select(max("rowNo")).first().getLong(0)
dfWithRow.filter(col("rowNo") <= maxIdToFilter).drop("rowNo").show(false)
I hope the answer is helpful

Related

transform distinct row values to different columns with corresponding rows using Pyspark

I'm new to Pyspark and trying to transform data
Given dataframe
Col1
A=id1a A=id2a B=id1b C=id1c B=id2b
D=id1d A=id3a B=id3b C=id2c
A=id4a C=id3c
Required:
A B C
id1a id1b id1c
id2a id2b id2c
id3a id3b id3b
id4a null null
I have tried pivot, but that gives first value.
There might be a better way , however an approach is splitting the column on spaces to create array of the entries and then using higher order functions(spark 2.4+) to split on the '=' for each entry in the splitted array .Then explode and create 2 columns one with the id and one with the value. Then we can assign a row number to each partition and groupby then pivot:
import pyspark.sql.functions as F
df1 = (df.withColumn("Col1",F.split(F.col("Col1"),"\s+")).withColumn("Col1",
F.explode(F.expr("transform(Col1,x->split(x,'='))")))
.select(F.col("Col1")[0].alias("cols"),F.col("Col1")[1].alias("vals")))
from pyspark.sql import Window
w = Window.partitionBy("cols").orderBy("cols")
final = (df1.withColumn("Rnum",F.row_number().over(w)).groupBy("Rnum")
.pivot("cols").agg(F.first("vals")).orderBy("Rnum"))
final.show()
+----+----+----+----+----+
|Rnum| A| B| C| D|
+----+----+----+----+----+
| 1|id1a|id1b|id1c|id1d|
| 2|id2a|id2b|id2c|null|
| 3|id3a|id3b|id3c|null|
| 4|id4a|null|null|null|
+----+----+----+----+----+
this is how df1 looks like after the transformation:
df1.show()
+----+----+
|cols|vals|
+----+----+
| A|id1a|
| A|id2a|
| B|id1b|
| C|id1c|
| B|id2b|
| D|id1d|
| A|id3a|
| B|id3b|
| C|id2c|
| A|id4a|
| C|id3c|
+----+----+
May be I don't know the full picture, but the data format seems to be strange. If nothing can be done at the data source, then some collects, pivots and joins will be needed. Try this.
import pyspark.sql.functions as F
test = sqlContext.createDataFrame([('A=id1a A=id2a B=id1b C=id1c B=id2b',1),('D=id1d A=id3a B=id3b C=id2c',2),('A=id4a C=id3c',3)],schema=['col1','id'])
tst_spl = test.withColumn("item",(F.split('col1'," ")))
tst_xpl = tst_spl.select(F.explode("item"))
tst_map = tst_xpl.withColumn("key",F.split('col','=')[0]).withColumn("value",F.split('col','=')[1]).drop('col')
#%%
tst_pivot = tst_map.groupby(F.lit(1)).pivot('key').agg(F.collect_list(('value'))).drop('1')
#%%
tst_arr = [tst_pivot.select(F.posexplode(coln)).withColumnRenamed('col',coln) for coln in tst_pivot.columns]
tst_fin = reduce(lambda df1,df2:df1.join(df2,on='pos',how='full'),tst_arr).orderBy('pos')
tst_fin.show()
+---+----+----+----+----+
|pos| A| B| C| D|
+---+----+----+----+----+
| 0|id3a|id3b|id1c|id1d|
| 1|id4a|id1b|id2c|null|
| 2|id1a|id2b|id3c|null|
| 3|id2a|null|null|null|
+---+----+----+----+----

How to check whether a the whole column in a pyspark contains a value using Expr

In pyspark how can i use expr to check whether a whole column contains the value in columnA of that row.
pseudo code below
df=df.withColumn("Result", expr(if any the rows in column1 contains the value colA (for this row) then 1 else 0))
Take an arbitrary example:
valuesCol = [('rose','rose is red'),('jasmine','I never saw Jasmine'),('lily','Lili dont be silly'),('daffodil','what a flower')]
df = sqlContext.createDataFrame(valuesCol,['columnA','columnB'])
df.show()
+--------+-------------------+
| columnA| columnB|
+--------+-------------------+
| rose| rose is red|
| jasmine|I never saw Jasmine|
| lily| Lili dont be silly|
|daffodil| what a flower|
+--------+-------------------+
Application of expr() here. How one can use expr(), just look for the corresponding SQL syntax and it should work with expr() mostly.
df = df.withColumn('columnA_exists',expr("(case when instr(lower(columnB), lower(columnA))>=1 then 1 else 0 end)"))
df.show()
+--------+-------------------+--------------+
| columnA| columnB|columnA_exists|
+--------+-------------------+--------------+
| rose| rose is red| 1|
| jasmine|I never saw Jasmine| 1|
| lily| Lili dont be silly| 0|
|daffodil| what a flower| 0|
+--------+-------------------+--------------+

Apache spark aggregation: aggregate column based on another column value

I am not sure if I am asking this correctly and maybe that is the reason why I didn't find the correct answer so far. Anyway, if it will be duplicate I will delete this question.
I have following data:
id | last_updated | count
__________________________
1 | 20190101 | 3
1 | 20190201 | 2
1 | 20190301 | 1
I want to group by this data by "id" column, get max value from "last_updated" and regarding "count" column I want keep value from row where "last_updated" has max value. So in that case result should be like that:
id | last_updated | count
__________________________
1 | 20190301 | 1
So I imagine it will look like that:
df
.groupBy("id")
.agg(max("last_updated"), ... ("count"))
Is there any function I can use to get "count" based on "last_updated" column.
I am using spark 2.4.0.
Thanks for any help
You have two options, the first the better as for my understanding
OPTION 1
Perform a window function over the ID, create a column with the max value over that window function. Then select where the desired column equals the max value and finally drop the column and rename the max column as desired
val w = Window.partitionBy("id")
df.withColumn("max", max("last_updated").over(w))
.where("max = last_updated")
.drop("last_updated")
.withColumnRenamed("max", "last_updated")
OPTION 2
You can perform a join with the original dataframe after grouping
df.groupBy("id")
.agg(max("last_updated").as("last_updated"))
.join(df, Seq("id", "last_updated"))
QUICK EXAMPLE
INPUT
df.show
+---+------------+-----+
| id|last_updated|count|
+---+------------+-----+
| 1| 20190101| 3|
| 1| 20190201| 2|
| 1| 20190301| 1|
+---+------------+-----+
OUTPUT
Option 1
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions
val w = Window.partitionBy("id")
df.withColumn("max", max("last_updated").over(w))
.where("max = last_updated")
.drop("last_updated")
.withColumnRenamed("max", "last_updated")
+---+-----+------------+
| id|count|last_updated|
+---+-----+------------+
| 1| 1| 20190301|
+---+-----+------------+
Option 2
df.groupBy("id")
.agg(max("last_updated").as("last_updated")
.join(df, Seq("id", "last_updated")).show
+---+-----------------+----------+
| id| last_updated| count |
+---+-----------------+----------+
| 1| 20190301| 1|
+---+-----------------+----------+

Filter rows based on a time stamp in another column Spark Scala

Suppose I have the following data frame in Spark Scala:
+--------+--------------------+--------------------+
|Index | Date| Date_x|
+--------+--------------------+--------------------+
| 1|2018-01-31T20:33:...|2018-01-31T21:18:...|
| 1|2018-01-31T20:35:...|2018-01-31T21:18:...|
| 1|2018-01-31T21:04:...|2018-01-31T21:18:...|
| 1|2018-01-31T21:05:...|2018-01-31T21:18:...|
| 1|2018-01-31T21:15:...|2018-01-31T21:18:...|
| 1|2018-01-31T21:16:...|2018-01-31T21:18:...|
| 1|2018-01-31T21:19:...|2018-01-31T21:18:...|
| 1|2018-01-31T21:20:...|2018-01-31T21:18:...|
| 2|2018-01-31T19:43:...|2018-01-31T20:35:...|
| 2|2018-01-31T19:44:...|2018-01-31T20:35:...|
| 2|2018-01-31T20:36:...|2018-01-31T20:35:...|
+--------+--------------------+--------------------+
I want to remove the rows where Date < Date_x for each Index, as illustrated below:
+--------+--------------------+--------------------+
|Index | Date| Date_x|
+--------+--------------------+--------------------+
| 1|2018-01-31T21:19:...|2018-01-31T21:18:...|
| 1|2018-01-31T21:20:...|2018-01-31T21:18:...|
| 2|2018-01-31T20:36:...|2018-01-31T20:35:...|
+--------+--------------------+--------------------+
I tried adding a column x_idx by using monotonically_increasing_id() and getting min(x_idx) for each Index where Date < Date_x. So that I can subsequently drop the rows from a data frame that don't satisfy the condition. But it doesn't seem to work for me. I probably miss the understanding of how agg() works. Thank you for your help!
val test_df = df.withColumn("x_idx", monotonically_increasing_id())
val newIdx = test_df
.filter($"Date" > "Date_x")
.groupBy($"Index")
.agg(min($"x_idx"))
.toDF("n_Index", "min_x_idx")
newIdx.show
+-------+--------+
|n_Index|min_x_idx|
+-------+--------+
+-------+--------+
You forgot to add $ in
.filter($"Date" > "Date_x")
so the correct filter is
.filter($"Date" > $"Date_x")
You can use alias instead of calling toDF as
val newIdx = test_df
.filter($"Date" > $"Date_x")
.groupBy($"Index".as("n_Index"))
.agg(min($"x_idx").as("min_x_idx"))
You should be getting output as
+-------+---------+
|n_Index|min_x_idx|
+-------+---------+
|1 |6 |
|2 |10 |
+-------+---------+
The filter condition might filtering all the records. Please check that print the dataframe after filtering records and make sure your filter works as you expected.
val newIdx = test_df
.filter($"Date" > $"Date_x")
.show

how to convert rows into columns in spark dataframe, scala

Is there any way to transpose dataframe rows into columns.
I have following structure as a input:
val inputDF = Seq(("pid1","enc1", "bat"),
("pid1","enc2", ""),
("pid1","enc3", ""),
("pid3","enc1", "cat"),
("pid3","enc2", "")
).toDF("MemberID", "EncounterID", "entry" )
inputDF.show:
+--------+-----------+-----+
|MemberID|EncounterID|entry|
+--------+-----------+-----+
| pid1| enc1| bat|
| pid1| enc2| |
| pid1| enc3| |
| pid3| enc1| cat|
| pid3| enc2| |
+--------+-----------+-----+
expected result:
+--------+----------+----------+----------+-----+
|MemberID|Encounter1|Encounter2|Encounter3|entry|
+--------+----------+----------+----------+-----+
| pid1| enc1| enc2| enc3| bat|
| pid3| enc1| enc2| null| cat|
+--------+----------+----------+----------+-----+
Please suggest if there is any optimized direct API available for transposing rows into columns.
my input data size is quite huge, so actions like collect, I wont be able to perform as it would take all the data on driver.
I am using Spark 2.x
I am not sure that what you need is what you actually asked. Yet, just in case here is an idea:
val entries = inputDF.where('entry isNotNull)
.where('entry !== "")
.select("MemberID", "entry").distinct
val df = inputDF.groupBy("MemberID")
.agg(collect_list("EncounterID") as "encounterList")
.join(entries, Seq("MemberID"))
df.show
+--------+-------------------------+-----+
|MemberID| encounterList |entry|
+--------+-------------------------+-----+
| pid1| [enc2, enc1, enc3]| bat|
| pid3| [enc2, enc1]| cat|
+--------+-------------------------+-----+
The order of the list is not deterministic but you may sort it and then extract new columns from it with .withColumn("Encounter1", sort_array($"encounterList")(0))...
Other idea
In case what you want is to put the value of entry in the corresponding "Encounter" column, you can use a pivot:
inputDF
.groupBy("MemberID")
.pivot("EncounterID", Seq("enc1", "enc2", "enc3"))
.agg(first("entry")).show
+--------+----+----+----+
|MemberID|enc1|enc2|enc3|
+--------+----+----+----+
| pid1| bat| | |
| pid3| cat| | |
+--------+----+----+----+
Adding Seq("enc1", "enc2", "enc3") is optionnal but since you know the content of the column, it will speed up the computation.