pyspark: duplicate row with column value from another row

pyspark: duplicate row with column value from another row - pyspark

My input df:
+-------------------+------------+
| windowStart| nodeId|
+-------------------+------------+
|2022-03-11 14:00:00|1 |
|2022-03-11 15:00:00|2 |
|2022-03-11 16:00:00|3 |
I would like to duplicate each row and use windowStart value of subsequent row, so the output should look like this:
+-------------------+------------+
| windowStart| nodeId|
+-------------------+------------+
|2022-03-11 14:00:00|1 |
|2022-03-11 15:00:00|1 |
|2022-03-11 15:00:00|2 |
|2022-03-11 16:00:00|2 |
|2022-03-11 16:00:00|3 |
How to achieve that ? Thanks !

df = spark.createDataFrame(
[
('2022-03-11 14:00:00','1'),
('2022-03-11 15:00:00','2'),
('2022-03-11 16:00:00','3')
], ['windowStart','nodeId'])
from pyspark.sql import Window as W
from pyspark.sql import functions as F
w = W.orderBy('windowStart')
df_lag = df\
.withColumn('lag', F.lead(F.col("windowStart"), 1).over(w))\
.select(F.col('lag').alias('windowStart'), 'nodeId')\
.filter(F.col('windowStart').isNotNull())
df.union(df_lag)\
.orderBy('windowStart', 'nodeId')\
.show()
+-------------------+------+
| windowStart|nodeId|
+-------------------+------+
|2022-03-11 14:00:00| 1|
|2022-03-11 15:00:00| 1|
|2022-03-11 15:00:00| 2|
|2022-03-11 16:00:00| 2|
|2022-03-11 16:00:00| 3|
+-------------------+------+

Related

Unsure how to apply row-wise normalization on pyspark dataframe

Disclaimer: I'm a beginner when it comes to Pyspark.
For each cell in a row, I'd like to apply the following function
new_col_i = col_i / max(col_1,col_2,col_3,...,col_n)
At the very end, I'd like the range of values to go from 0.0 to 1.0.
Here are the details of my dataframe:
Dimensions: (6.5M, 2905)
Dtypes: Double
Initial DF:
+-----+-------+-------+-------+
|. id| col_1| col_2| col_n |
+-----+-------+-------+-------+
| 1| 7.5| 0.1| 2.0|
| 2| 0.3| 3.5| 10.5|
+-----+-------+-------+-------+
Updated DF:
+-----+-------+-------+-------+
|. id| col_1| col_2| col_n |
+-----+-------+-------+-------+
| 1| 1.0| 0.013| 0.26|
| 2| 0.028| 0.33| 1.0|
+-----+-------+-------+-------+
Any help would be appreciated.

You can find the maximum value from an array of columns and loop your dataframe to replace the normalized column value.
cols = df.columns[1:]
import builtins as p
df2 = df.withColumn('max', array_max(array(*[col(c) for c in cols]))) \
for c in cols:
df2 = df2.withColumn(c, col(c) / col('max'))
df2.show()
+---+-------------------+--------------------+-------------------+----+
| id| col_1| col_2| col_n| max|
+---+-------------------+--------------------+-------------------+----+
| 1| 1.0|0.013333333333333334|0.26666666666666666| 7.5|
| 2|0.02857142857142857| 0.3333333333333333| 1.0|10.5|
+---+-------------------+--------------------+-------------------+----+

HI,Could you please help me resolving Issue while creating new column in Pyspark: I explained the issue as below:

query I'm using:
I want to replace existing columns with new values on condition, if value of another col = ABC then column remain same otherwise should give null or blank.
It's giving result as per logic but only for last column it encounters in loop.
import pyspark.sql.functions as F
for i in df.columns:
if i[4:]!='ff':
new_df=df.withColumn(i,F.when(df.col_ff=="abc",df[i])\
.otherwise(None))
df:
+------+----+-----+-------+
| col1 |col2|col3 | col_ff|
+------+----+-----+-------+
| a | a | d | abc |
| a | b | c | def |
| b | c | b | abc |
| c | d | a | def |
+------+----+-----+-------+
required output:
+------+----+-----+-------+
| col1 |col2|col3 | col_ff|
+------+----+-----+-------+
| a | a | d | abc |
| null |null|null | def |
| b | c | b | abc |
| null |null|null | def |
+------+----+-----+-------+

The problem in your code is that you're overwriting new_df with the original DataFrame df in each iteration of the loop. You can fix it by first setting new_df = df outside of the loop, and then performing the withColumn operations on new_df inside the loop.
For example, if df were the following:
df.show()
#+----+----+----+------+
#|col1|col2|col3|col_ff|
#+----+----+----+------+
#| a| a| d| abc|
#| a| b| c| def|
#| b| c| b| abc|
#| c| d| a| def|
#+----+----+----+------+
Change your code to:
import pyspark.sql.functions as F
new_df = df
for i in df.columns:
if i[4:]!='ff':
new_df = new_df.withColumn(i, F.when(F.col("col_ff")=="abc", F.col(i)))
Notice here that I removed the .otherwise(None) part because when will return null by default if the condition is not met.
You could also do the same using functools.reduce:
from functools import reduce # for python3
new_df = reduce(
lambda df, i: df.withColumn(i, F.when(F.col("col_ff")=="abc", F.col(i))),
[i for i in df.columns if i[4:] != "ff"],
df
)
In both cases the result is the same:
new_df.show()
#+----+----+----+------+
#|col1|col2|col3|col_ff|
#+----+----+----+------+
#| a| a| d| abc|
#|null|null|null| def|
#| b| c| b| abc|
#|null|null|null| def|
#+----+----+----+------+

How to sort on a variable within each group in pyspark?

I am trying to sort a value val using another column ts for each id.
# imports
from pyspark.sql import functions as F
from pyspark.sql import SparkSession as ss
import pandas as pd
# create dummy data
pdf = pd.DataFrame( [['2',2,'cat'],['1',1,'dog'],['1',2,'cat'],['2',3,'cat'],['2',4,'dog']] ,columns=['id','ts','val'])
sdf = ss.createDataFrame( pdf )
sdf.show()
+---+---+---+
| id| ts|val|
+---+---+---+
| 2| 2|cat|
| 1| 1|dog|
| 1| 2|cat|
| 2| 3|cat|
| 2| 4|dog|
+---+---+---+

You can aggregate by id and sort by ts:
sorted_sdf = ( sdf.groupBy('id')
.agg( F.sort_array( F.collect_list( F.struct( F.col('ts'), F.col('val') ) ), asc = True)
.alias('sorted_col') )
)
sorted_sdf.show()
+---+--------------------+
| id| sorted_col|
+---+--------------------+
| 1| [[1,dog], [2,cat]]|
| 2|[[2,cat], [3,cat]...|
+---+--------------------+
Then, we can explode this list:
explode_sdf = sorted_sdf.select( 'id' , F.explode( F.col('sorted_col') ).alias('sorted_explode') )
explode_sdf.show()
+---+--------------+
| id|sorted_explode|
+---+--------------+
| 1| [1,dog]|
| 1| [2,cat]|
| 2| [2,cat]|
| 2| [3,cat]|
| 2| [4,dog]|
+---+--------------+
Break the tuples of sorted_explode into two:
detupled_sdf = explode_sdf.select( 'id', 'sorted_explode.*' )
detupled_sdf.show()
+---+---+---+
| id| ts|val|
+---+---+---+
| 1| 1|dog|
| 1| 2|cat|
| 2| 2|cat|
| 2| 3|cat|
| 2| 4|dog|
+---+---+---+
Now our original dataframe is sorted by ts for each id!

Compare two dataframes and update the values

I have two dataframes like following.
val file1 = spark.read.format("csv").option("sep", ",").option("inferSchema", "true").option("header", "true").load("file1.csv")
file1.show()
+---+-------+-----+-----+-------+
| id| name|mark1|mark2|version|
+---+-------+-----+-----+-------+
| 1| Priya | 80| 99| 0|
| 2| Teju | 10| 5| 0|
+---+-------+-----+-----+-------+
val file2 = spark.read.format("csv").option("sep", ",").option("inferSchema", "true").option("header", "true").load("file2.csv")
file2.show()
+---+-------+-----+-----+-------+
| id| name|mark1|mark2|version|
+---+-------+-----+-----+-------+
| 1| Priya | 80| 99| 0|
| 2| Teju | 70| 5| 0|
+---+-------+-----+-----+-------+
Now I am comparing two dataframes and filtering out the mismatch values like this.
val columns = file1.schema.fields.map(_.name)
val selectiveDifferences = columns.map(col => file1.select(col).except(file2.select(col)))
selectiveDifferences.map(diff => {if(diff.count > 0) diff.show})
+-----+
|mark1|
+-----+
| 10|
+-----+
I need to add the extra row into the dataframe, 1 for the mismatch value from the dataframe 2 and update the version number like this.
file1.show()
+---+-------+-----+-----+-------+
| id| name|mark1|mark2|version|
+---+-------+-----+-----+-------+
| 1| Priya | 80| 99| 0|
| 2| Teju | 10| 5| 0|
| 3| Teju | 70| 5| 1|
+---+-------+-----+-----+-------+
I am struggling to achieve the above step and it is my expected output. Any help would be appreciated.

You can get your final dataframe by using except and union as following
val count = file1.count()
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
file1.union(file2.except(file1)
.withColumn("version", lit(1)) //changing the version
.withColumn("id", (row_number.over(Window.orderBy("id")))+lit(count)) //changing the id number
)
lit, row_number and window functions are used to generate the id and versions
Note : use of window function to generate the new id makes the process inefficient as all the data would be collected in one executor for generating new id

Extracting array index in Spark Dataframe

I have a Dataframe with a Column of Array Type
For example :
val df = List(("a", Array(1d,2d,3d)), ("b", Array(4d,5d,6d))).toDF("ID", "DATA")
df: org.apache.spark.sql.DataFrame = [ID: string, DATA: array<double>]
scala> df.show
+---+---------------+
| ID| DATA|
+---+---------------+
| a|[1.0, 2.0, 3.0]|
| b|[4.0, 5.0, 6.0]|
+---+---------------+
I wish to explode the array and have index like
+---+------------------+
| ID| DATA_INDEX| DATA|
+---+------------------+
| a|1 | 1.0 |
| a|2 | 2.0 |
| a|3 | 3.0 |
| b|1 | 4.0 |
| b|2 | 5.0 |
| b|3 | 6.0 |
+---+------------+-----+
I wish be able to do that with scala, and Sparlyr or SparkR
I'm using spark 1.6

There is a posexplode function available in spark functions
import org.apache.spark.sql.functions._
df.select("ID", posexplode($"DATA))
PS: This is only available after 2.1.0 versions

With Spark 1.6, you can register you dataframe as a temporary table and then run Hive QL over it to get the desired result.
df.registerTempTable("tab")
sqlContext.sql("""
select
ID, exploded.DATA_INDEX + 1 as DATA_INDEX, exploded.DATA
from
tab
lateral view posexplode(tab.DATA) exploded as DATA_INDEX, DATA
""").show
+---+----------+----+
| ID|DATA_INDEX|DATA|
+---+----------+----+
| a| 1| 1.0|
| a| 2| 2.0|
| a| 3| 3.0|
| b| 1| 4.0|
| b| 2| 5.0|
| b| 3| 6.0|
+---+----------+----+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

pyspark: duplicate row with column value from another row - pyspark

Related

Unsure how to apply row-wise normalization on pyspark dataframe

HI,Could you please help me resolving Issue while creating new column in Pyspark: I explained the issue as below:

How to sort on a variable within each group in pyspark?

Compare two dataframes and update the values

Extracting array index in Spark Dataframe

Categories

Resources