Converting (casting) columns into rows in Pyspark

Converting (casting) columns into rows in Pyspark - pyspark

I have a spark dataframe in the below format where each unique id can have maximum of 3 rows which is given by rank column.
id pred prob rank
485 9716 0.19205872 1
729 9767 0.19610429 1
729 9716 0.186840048 2
729 9748 0.173447074 3
818 9731 0.255104463 1
818 9748 0.215499913 2
818 9716 0.207307154 3
I want to convert (cast) into a row wise data such that each id has just one row and the pred & prob column have multiple columns differentiated by rank variable( column postfix).
id pred_1 prob_1 pred_2 prob_2 pred_3 prob_3
485 9716 0.19205872
729 9767 0.19610429 9716 0.186840048 9748 0.173447074
818 9731 0.255104463 9748 0.215499913 9716 0.207307154
I am not able to figure out how to o it in Pyspark
Sample code for input data creation:
# Loading the requisite packages
from pyspark.sql.functions import col, explode, array, struct, expr, sum, lit
# Creating the DataFrame
df = sqlContext.createDataFrame([(485,9716,19,1),(729,9767,19,1),(729,9716,18,2), (729,9748,17,3), (818,9731,25,1), (818,9748,21,2), (818,9716,20,3)],('id','pred','prob','rank'))
df.show()

This is the pivot on multiple columns problem.Try:
import pyspark.sql.functions as F
df_pivot = df.groupBy('id').pivot('rank').agg(F.first('pred').alias('pred'), F.first('prob').alias('prob')).orderBy('id')
df_pivot.show(truncate=False)

Related

Replace date value in pyspark by maximum of two column

I'm using pyspark 3.0.1. I have a dataframe df with following details
ID Class dateEnrolled dateStarted
32 1 2016-01-09 2016-01-26
25 1 2016-01-09 2016-01-10
33 1 2016-01-16 2016-01-05
I need to replace dateEnrolled my latest of two date field & my data should look like
ID Class dateEnrolled dateStarted
32 1 2016-01-26 2016-01-26
25 1 2016-01-10 2016-01-10
33 1 2016-01-16 2016-01-05
Can you suggest me how to do that?

You can use greatest:
import pyspark.sql.functions as F
df2 = df.withColumn('dateEnrolled', F.greatest('dateEnrolled', 'dateStarted'))

Pyspark - How to concatenate columns of multiple dataframes into columns of one dataframe

I have multiple data frames (24 in total) with one column. I need to combine all of them to a single data frame. I created indexes and joined using indexes but it is quite slow to join all of them (All has same number of rows).
Please note that I'm using Pyspark 2.1
w = Window().orderBy(lit('A'))
df1 = df1.withColumn('Index',row_number().over(w))
df2 = df2.withColumn('Index',row_number().over(w))
joined_df = df1.join(df2,df1.Index=df2.Index,'Inner').drop(df2.Index)
df3 = df3.withColumn('Index',row_number().over(w))
joined_df = joined_df.join(df3,joined_df.Index=df3.Index).drop(df3.Index)
But as the joined_df grows, it keeps getting slower
DF1:
Col1
2
8
18
12
DF2:
Col2
abc
bcd
def
bbc
DF3:
Col3
1.0
2.2
12.1
1.9
Expected Results:
joined_df:
Col1 Col2 Col3
2 abc 1.0
8 bcd 2.2
18 def 12.1
12 bbc 1.9

You're doing it the correct way. Unfortunately without a primary key, spark is not suited for this type of operation.
Answer by pault, pulled from comment.

creating data-frame from pipe & comma delimited file

I am trying to create data-frame form a data feed which has the following format,
ABC,13:10,23| PQR,01:20,2| XYZ,07:30,14
BCD,11:40,13| ABC,05:50,9| RST,17:20,5
Each record is pipe delimited and comes in batch of 3 and consists of 3 sub records.
I intend to have each sub record as a column and each record aa one row of the data frame.So the above would result in 3 columns and 9 rows.
col1 col2 col3
ABC 13:10 23
PQR 01:20 2

from pyspark.sql.functions import split, explode
df = spark.read.text("/path/to/data.csv")
df.select(explode(split(df["value"], "\|"))).show()

spark Group By data-frame columns without aggregation [duplicate]

This question already has answers here:
How to aggregate values into collection after groupBy?
(3 answers)
Closed 4 years ago.
I have a csv file in hdfs : /hdfs/test.csv, I like to group below data using spark & scala, I need a output some this like this.
I want to group by A1...AN column based on A1 column and the output should be something like this
all the rows should be grouped like below.
OUTPUt:
JACK , ABCD, ARRAY("0,1,0,1", "2,9,2,9")
JACK , LMN, ARRAY("0,1,0,3", "0,4,3,T")
JACK, HBC, ARRAY("1,T,5,21", "E7,4W,5,8)
Input:
++++++++++++++++++++++++++++++
name A1 A1 A2 A3..AN
--------------------------------
JACK ABCD 0 1 0 1
JACK LMN 0 1 0 3
JACK ABCD 2 9 2 9
JAC HBC 1 T 5 21
JACK LMN 0 4 3 T
JACK HBC E7 4W 5 8
I need a below output in spark scala
JACK , ABCD, ARRAY("0,1,0,1", "2,9,2,9")
JACK , LMN, ARRAY("0,1,0,3", "0,4,3,T")
JACK, HBC, ARRAY("1,T,5,21", "E7,4W,5,8)

You can achieve this by having the columns as an array.
import org.apache.spark.sql.functions.{collect_set, concat_ws, array, col}
val aCols = 1.to(250).map( x -> col(s"A$x"))
val concatCol = concat_ws(",", array(aCols : _*))
groupedDf = df.withColumn("aConcat", concatCol).
groupBy("name", "A").
agg(collect_set("aConcat"))
If you're okay with duplicates you can also use collect_list instead of collect_set.

Your input has two different columns called A1. I will assume the groupBy category is called A, while the element to put in that final array is A1.
If you load the data into a DataFrame, you can do this to achieve the output specified:
import org.apache.spark.sql.functions.{collect_set, concat_ws}
val grouped = someDF
.groupBy($"name", $"A")
.agg(collect_set(concat_ws(",", $"A1", $"A2", $"A3", $"A4")).alias("grouped"))

filter rows based on combination of 2 columns in Spark DF

input DF:
A B
1 1
2 1
2 2
3 3
3 1
3 2
3 3
3 4
I am trying to filter the rows based on the combination of
(A, Max(B))
Output Df:
A B
1 1
2 3
3 4
I am able to do this with
df.groupBy()
But there are also other columns in the DF which I want to be selected but do not want to be included in the GroupBy
So that condition on filtering the rows should only be wrt these columns and not the other columns in the DF. Ay suggestions please>

As suggested in How to get other columns when using Spark DataFrame groupby? you can use window functions
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
df.withColumn("maxB", max(col("B")).over(Window.partitionBy("A"))).where(...)
where ... is replaced by a predicate based on A and maxB.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Converting (casting) columns into rows in Pyspark - pyspark

This is the pivot on multiple columns problem.Try: import pyspark.sql.functions as F df_pivot = df.groupBy('id').pivot('rank').agg(F.first('pred').alias('pred'), F.first('prob').alias('prob')).orderBy('id') df_pivot.show(truncate=False)

Related

Replace date value in pyspark by maximum of two column

Pyspark - How to concatenate columns of multiple dataframes into columns of one dataframe

creating data-frame from pipe & comma delimited file

spark Group By data-frame columns without aggregation [duplicate]

filter rows based on combination of 2 columns in Spark DF

Categories

Resources