pyspark break row to multiple rows

pyspark break row to multiple rows - pyspark

I'm trying to accomplish the following in PYSPARK. Sample source is provided below. We will be having more number of records in source.
Source:
Expected output:

You could use the stackfunction:
Setup of your example data:
from pyspark.sql import Row, SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
Row(COLA='H', COLB='I', COLC='J',
COL_GRP_A_1=0.1, COL_GRP_A_2=1., COL_GRP_A_3=3.,
COL_GRP_B_1=4., COL_GRP_B_2=2.5, COL_GRP_B_3=6.,
COL_GRP_C_1=2., COL_GRP_C_2=5., COL_GRP_C_3=4.,
),
])
df.show()
# Output
+----+----+----+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
|COLA|COLB|COLC|COL_GRP_A_1|COL_GRP_A_2|COL_GRP_A_3|COL_GRP_B_1|COL_GRP_B_2|COL_GRP_B_3|COL_GRP_C_1|COL_GRP_C_2|COL_GRP_C_3|
+----+----+----+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
| H| I| J| 0.1| 1.0| 3.0| 4.0| 2.5| 6.0| 2.0| 5.0| 4.0|
+----+----+----+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
Now the processing:
(
df
.selectExpr(
'COLA', 'COLB', 'COLC',
'stack(3, "COL_GRP_A", COL_GRP_A_1, COL_GRP_A_2, COL_GRP_A_3, "COL_GRP_B", COL_GRP_B_1, COL_GRP_B_2, COL_GRP_B_3, "COL_GRP_C", COL_GRP_C_1, COL_GRP_C_2, COL_GRP_C_3) AS (GRP, COL_VAL1, COL_VAL2, COL_VAL3)'
)
.show()
)
# Output:
+----+----+----+---------+--------+--------+--------+
|COLA|COLB|COLC| GRP|COL_VAL1|COL_VAL2|COL_VAL3|
+----+----+----+---------+--------+--------+--------+
| H| I| J|COL_GRP_A| 0.1| 1.0| 3.0|
| H| I| J|COL_GRP_B| 4.0| 2.5| 6.0|
| H| I| J|COL_GRP_C| 2.0| 5.0| 4.0|
+----+----+----+---------+--------+--------+--------+

Related

How do I replace null values of multiple columns with values from multiple different columns

I have a data frame like below
data = [
(1, None,7,10,11,19),
(1, 4,None,10,43,58),
(None, 4,7,67,88,91),
(1, None,7,78,96,32)
]
df = spark.createDataFrame(data, ["A_min", "B_min","C_min","A_max", "B_max","C_max"])
df.show()
and I would want the columns which show name as 'min' to be replaced by their equivalent max column.
Example null values of A_min column should be replaced by A_max column
It should be like the data frame below.
+-----+-----+-----+-----+-----+-----+
|A_min|B_min|C_min|A_max|B_max|C_max|
+-----+-----+-----+-----+-----+-----+
| 1| 11| 7| 10| 11| 19|
| 1| 4| 58| 10| 43| 58|
| 67| 4| 7| 67| 88| 91|
| 1| 96| 7| 78| 96| 32|
+-----+-----+-----+-----+-----+-----+
I have tried the code below by defining the columns but clearly this does not work. Really appreciate any help.
min_cols = ["A_min", "B_min","C_min"]
max_cols = ["A_max", "B_max","C_max"]
for i in min_cols
df = df.withColumn(i,when(f.col(i)=='',max_cols.otherwise(col(i))))
display(df)

Assuming you have the same number of max and min columns, you can use coalesce along with python's list comprehension to obtain your solution
from pyspark.sql.functions import coalesce
min_cols = ["A_min", "B_min","C_min"]
max_cols = ["A_max", "B_max","C_max"]
df.select(*[coalesce(df[val], df[max_cols[pos]]).alias(val) for pos, val in enumerate(min_cols)], *max_cols).show()
Output:
+-----+-----+-----+-----+-----+-----+
|A_min|B_min|C_min|A_max|B_max|C_max|
+-----+-----+-----+-----+-----+-----+
| 1| 11| 7| 10| 11| 19|
| 1| 4| 58| 10| 43| 58|
| 67| 4| 7| 67| 88| 91|
| 1| 96| 7| 78| 96| 32|
+-----+-----+-----+-----+-----+-----+

Convert matrix to Pyspark Dataframe

I've a matrix which is 1000*10000 size. I want to convert this matrix into pyspark dataframe.
Can someone please tell me how to do it? This post has an example. But my number of columns is large. So, assigning column names manually will be difficult.
Thanks!

In order to create a Pyspark Dataframe, you can use the function createDataFrame()
matrix=([11,12,13,14,15],[21,22,23,24,25],[31,32,33,34,35],[41,42,43,44,45])
df=spark.createDataFrame(matrix)
df.show()
+---+---+---+---+---+
| _1| _2| _3| _4| _5|
+---+---+---+---+---+
| 11| 12| 13| 14| 15|
| 21| 22| 23| 24| 25|
| 31| 32| 33| 34| 35|
| 41| 42| 43| 44| 45|
+---+---+---+---+---+
As you can see above, the columns will be named automatically with numbers.
You can also pass your own column names to the createDataFrame() function:
columns=[ 'mycol_'+str(col) for col in range(5) ]
df=spark.createDataFrame(matrix,schema=columns)
df.show()
+-------+-------+-------+-------+-------+
|mycol_0|mycol_1|mycol_2|mycol_3|mycol_4|
+-------+-------+-------+-------+-------+
| 11| 12| 13| 14| 15|
| 21| 22| 23| 24| 25|
| 31| 32| 33| 34| 35|
| 41| 42| 43| 44| 45|
+-------+-------+-------+-------+-------+

saving contents of df.show() as a string in pyspark

from pyspark.sql import SparkSession
sc = SparkSession.builder.getOrCreate()
prsn = sc.read.format("csv").option("delimiter", ",").option("header", "true").option("inferSchema", "true").load("C:/Users/amit.suar/IdeaProjects/LearningPyspark/prsn.csv")
prsn.show()
+-------------------------+------------------------+---+-----------+-----------------------------+
|PERSON_MEDIA_CONSUMER_KEY|PERSON_MEDIA_CONSUMER_ID|AGE|GENDER_CODE|EDUCATION_LEVEL_CATEGORY_CODE|
+-------------------------+------------------------+---+-----------+-----------------------------+
| 101| 3285854| 15| 1| 1|
| 102| 2313090| 25| 1| 3|
| 103| 2295854| 33| 2| 6|
| 104| 2295854| 33| 2| 6|
| 105| 2471554| 26| 2| 4|
| 106| 2471554| 26| 2| 4|
+-------------------------+------------------------+---+-----------+-----------------------------+
i want to capture this output as a string in a variable..how can i achieve it?

There is internal/private function that return the same string as .show() prints:
# Return dataframe as a table of first n records (20 by default)
dataframe._jdf.showString(n, 20)

update dataframe based on union function spark

Hello guys i have a dataframe that needs to be updated based on another dataframe there are field that we are going to sum and other to just take the new value provided by the second dataframe , here what i did
val hist1 = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("C:/Users/MHT/Desktop/histocaisse_dte1.csv")
.withColumn("article_id", 'article_id.cast(LongType))
.withColumn("pos_id", 'pos_id.cast(LongType))
.withColumn("qte", 'qte.cast(LongType))
.withColumn("ca", 'ca.cast(DoubleType))
hist1.show
val hist2 = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("C:/Users/MHT/Desktop/his2.csv")
.withColumn("article_id", 'article_id.cast(LongType))
.withColumn("date", 'date.cast(DateType))
.withColumn("qte", 'qte.cast(LongType))
.withColumn("ca", 'ca.cast(DoubleType))
hist2.show
val df3 = hist1.unionAll(hist2)
//
val df4 = df3.groupBy("pos_id", "article_id").agg($"pos_id", $"article_id", max("date"), sum("qte"), sum("ca"))
df4.show
+------+----------+----------+---+----+----------+
|pos_id|article_id| date|qte| ca|sale_price|
+------+----------+----------+---+----+----------+
| 1| 1|2000-01-07| 3| 3.5| 14.3|
| 2| 2|2000-01-07| 15|12.0| 13.2|
| 3| 2|2000-01-07| 4| 1.2| 14.3|
| 4| 2|2000-01-07| 4| 1.2| 12.3|
+------+----------+----------+---+----+----------+
+------+----------+----------+---+----+----------+
|pos_id|article_id| date|qte| ca|sale_price|
+------+----------+----------+---+----+----------+
| 1| 1|2000-01-08| 3| 3.5| 14.5|
| 2| 2|2000-01-08| 15|12.0| 20.2|
| 3| 2|2000-01-08| 4| 1.2| 17.5|
| 4| 2|2000-01-08| 4| 1.2| 18.2|
| 5| 3|2000-01-08| 15| 1.2| 11.2|
| 6| 1|2000-01-08| 2|1.25| 13.5|
| 6| 2|2000-01-08| 2|1.25| 14.3|
+------+----------+----------+---+----+----------+
+------+----------+----------+--------+-------+
|pos_id|article_id| max(date)|sum(qte)|sum(ca)|
+------+----------+----------+--------+-------+
| 2| 2|2000-01-08| 30| 24.0|
| 3| 2|2000-01-08| 8| 2.4|
| 1| 1|2000-01-08| 6| 7.0|
| 5| 3|2000-01-08| 15| 1.2|
| 6| 1|2000-01-08| 2| 1.25|
| 6| 2|2000-01-08| 2| 1.25|
| 4| 2|2000-01-08| 8| 2.4|
+------+----------+----------+--------+-------+
How would the request be , if i want to append the field sale_price and consider the new sale_price provided by the second dataframe
how would this request be
val df4 = df3.groupBy("pos_id", "article_id").agg($"pos_id", $"article_id", max("date"), sum("qte"), sum("ca"))
Many thanks in advance

You can use join on the last line as the following
val df4 = df3.groupBy("pos_id", "article_id").agg(max("date"), sum("qte"), sum("ca")).join(hist2.select("pos_id", "article_id", "sale_price"), Seq("pos_id", "article_id"))
You should have your desired output

How to get Running sum of based on two columns using Spark scala RDD

I have data in RDD which have 4 columns like geog, product, time and price. I want to calculate the running sum based on geog and time.
Given Data
I need result like.
[
I need this spark-Scala-RDD. I am new to this Scala world, i can achieve this easily in SQL. i want do this in spark -Scala -RDD like using (map,flatmap).
Advance thanks for your help.

This is possible by defining a window function:
>>> val data = List(
("India","A1","Q1",40),
("India","A2","Q1",30),
("India","A3","Q1",21),
("German","A1","Q1",50),
("German","A3","Q1",60),
("US","A1","Q1",60),
("US","A2","Q2",25),
("US","A4","Q1",20),
("US","A5","Q5",15),
("US","A3","Q3",10)
)
>>> val df = sc.parallelize(data).toDF("country", "part", "quarter", "result")
>>> df.show()
+-------+----+-------+------+
|country|part|quarter|result|
+-------+----+-------+------+
| India| A1| Q1| 40|
| India| A2| Q1| 30|
| India| A3| Q1| 21|
| German| A1| Q1| 50|
| German| A3| Q1| 60|
| US| A1| Q1| 60|
| US| A2| Q2| 25|
| US| A4| Q1| 20|
| US| A5| Q5| 15|
| US| A3| Q3| 10|
+-------+----+-------+------+
>>> val window = Window.partitionBy("country").orderBy("part", "quarter")
>>> val resultDF = df.withColumn("agg", sum(df("result")).over(window))
>>> resultDF.show()
+-------+----+-------+------+---+
|country|part|quarter|result|agg|
+-------+----+-------+------+---+
| India| A1| Q1| 40| 40|
| India| A2| Q1| 30| 70|
| India| A3| Q1| 21| 91|
| US| A1| Q1| 60| 60|
| US| A2| Q2| 25| 85|
| US| A3| Q3| 10| 95|
| US| A4| Q1| 20|115|
| US| A5| Q5| 15|130|
| German| A1| Q1| 50| 50|
| German| A3| Q1| 60|110|
+-------+----+-------+------+---+
You can do this using Window functions, please take a look at the Databrick blog about Windows:
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
Hope this helps.
Happy Sparking! Cheers, Fokko

I think this will help others also. I tried in SCALA RDD.
val fileName_test_1 ="C:\\venkat_workshop\\Qintel\\Data_Files\\test_1.txt"
val rdd1 = sc.textFile(fileName_test_1).map { x => (x.split(",")(0).toString() ,
x.split(",")(1).toString(),
x.split(",")(2).toString(),
x.split(",")(3).toDouble
)
}.groupBy( x => (x._1,x._3) )
.mapValues
{
_.toList.sortWith
{
(a,b) => (a._4) > (b._4)
}.scanLeft("","","",0.0,0.0){
(a,b) => (b._1,b._2,b._3,b._4,b._4+a._5)
}.tail
}.flatMapValues(f => f).values

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

pyspark break row to multiple rows - pyspark

I'm trying to accomplish the following in PYSPARK. Sample source is provided below. We will be having more number of records in source. Source: Expected output:

Related

How do I replace null values of multiple columns with values from multiple different columns

Convert matrix to Pyspark Dataframe

saving contents of df.show() as a string in pyspark

update dataframe based on union function spark

How to get Running sum of based on two columns using Spark scala RDD

Categories

Resources