Pyspark: Mode formula with if error on multiple columns - pyspark

I have a pyspark dataframe with 6 columns as below:
MLP Accuracy_MLP XGB Accuracy_XGB RF Accuracy_RF
3411 0.99199374 3411 0.935491304 3411 0.42968293
9518 0.99999988 9623 0.884243041 4567 0.686784383
9518 0.999999882 9518 0.957964659 2567 0.801463674
I want to add an extra column as 'RESULT' which has the rule as below:
Mode of MLP, XGB, RF column
If error then take only RF column
In excel the formula will be iferror(MODE.MULTI(MLP,XGB,RF), RF)
Result:
MLP Accuracy_MLP XGB Accuracy_XGB RF Accuracy_RF RESULT
3411 0.99199374 3411 0.935491304 3411 0.42968293 3411
9518 0.99999988 9623 0.884243041 4567 0.686784383 4567
9518 0.999999882 9518 0.957964659 2567 0.801463674 9518
List item

There's no "IFERROR" in PySpark but you can handle several conditions with when / otherwise statement:
https://sparkbyexamples.com/pyspark/pyspark-when-otherwise/

Related

Converting (casting) columns into rows in Pyspark

I have a spark dataframe in the below format where each unique id can have maximum of 3 rows which is given by rank column.
id pred prob rank
485 9716 0.19205872 1
729 9767 0.19610429 1
729 9716 0.186840048 2
729 9748 0.173447074 3
818 9731 0.255104463 1
818 9748 0.215499913 2
818 9716 0.207307154 3
I want to convert (cast) into a row wise data such that each id has just one row and the pred & prob column have multiple columns differentiated by rank variable( column postfix).
id pred_1 prob_1 pred_2 prob_2 pred_3 prob_3
485 9716 0.19205872
729 9767 0.19610429 9716 0.186840048 9748 0.173447074
818 9731 0.255104463 9748 0.215499913 9716 0.207307154
I am not able to figure out how to o it in Pyspark
Sample code for input data creation:
# Loading the requisite packages
from pyspark.sql.functions import col, explode, array, struct, expr, sum, lit
# Creating the DataFrame
df = sqlContext.createDataFrame([(485,9716,19,1),(729,9767,19,1),(729,9716,18,2), (729,9748,17,3), (818,9731,25,1), (818,9748,21,2), (818,9716,20,3)],('id','pred','prob','rank'))
df.show()
This is the pivot on multiple columns problem.Try:
import pyspark.sql.functions as F
df_pivot = df.groupBy('id').pivot('rank').agg(F.first('pred').alias('pred'), F.first('prob').alias('prob')).orderBy('id')
df_pivot.show(truncate=False)

Pyspark - How to concatenate columns of multiple dataframes into columns of one dataframe

I have multiple data frames (24 in total) with one column. I need to combine all of them to a single data frame. I created indexes and joined using indexes but it is quite slow to join all of them (All has same number of rows).
Please note that I'm using Pyspark 2.1
w = Window().orderBy(lit('A'))
df1 = df1.withColumn('Index',row_number().over(w))
df2 = df2.withColumn('Index',row_number().over(w))
joined_df = df1.join(df2,df1.Index=df2.Index,'Inner').drop(df2.Index)
df3 = df3.withColumn('Index',row_number().over(w))
joined_df = joined_df.join(df3,joined_df.Index=df3.Index).drop(df3.Index)
But as the joined_df grows, it keeps getting slower
DF1:
Col1
2
8
18
12
DF2:
Col2
abc
bcd
def
bbc
DF3:
Col3
1.0
2.2
12.1
1.9
Expected Results:
joined_df:
Col1 Col2 Col3
2 abc 1.0
8 bcd 2.2
18 def 12.1
12 bbc 1.9
You're doing it the correct way. Unfortunately without a primary key, spark is not suited for this type of operation.
Answer by pault, pulled from comment.

spark Group By data-frame columns without aggregation [duplicate]

This question already has answers here:
How to aggregate values into collection after groupBy?
(3 answers)
Closed 4 years ago.
I have a csv file in hdfs : /hdfs/test.csv, I like to group below data using spark & scala, I need a output some this like this.
I want to group by A1...AN column based on A1 column and the output should be something like this
all the rows should be grouped like below.
OUTPUt:
JACK , ABCD, ARRAY("0,1,0,1", "2,9,2,9")
JACK , LMN, ARRAY("0,1,0,3", "0,4,3,T")
JACK, HBC, ARRAY("1,T,5,21", "E7,4W,5,8)
Input:
++++++++++++++++++++++++++++++
name A1 A1 A2 A3..AN
--------------------------------
JACK ABCD 0 1 0 1
JACK LMN 0 1 0 3
JACK ABCD 2 9 2 9
JAC HBC 1 T 5 21
JACK LMN 0 4 3 T
JACK HBC E7 4W 5 8
I need a below output in spark scala
JACK , ABCD, ARRAY("0,1,0,1", "2,9,2,9")
JACK , LMN, ARRAY("0,1,0,3", "0,4,3,T")
JACK, HBC, ARRAY("1,T,5,21", "E7,4W,5,8)
You can achieve this by having the columns as an array.
import org.apache.spark.sql.functions.{collect_set, concat_ws, array, col}
val aCols = 1.to(250).map( x -> col(s"A$x"))
val concatCol = concat_ws(",", array(aCols : _*))
groupedDf = df.withColumn("aConcat", concatCol).
groupBy("name", "A").
agg(collect_set("aConcat"))
If you're okay with duplicates you can also use collect_list instead of collect_set.
Your input has two different columns called A1. I will assume the groupBy category is called A, while the element to put in that final array is A1.
If you load the data into a DataFrame, you can do this to achieve the output specified:
import org.apache.spark.sql.functions.{collect_set, concat_ws}
val grouped = someDF
.groupBy($"name", $"A")
.agg(collect_set(concat_ws(",", $"A1", $"A2", $"A3", $"A4")).alias("grouped"))

Whats the scala spark equivalent of subset of R

Assume I have a data frame df in spark, with the structure like so.
Input:
amount city
10000 la
12145 ng
14000 wy
18000 la
How can subset the data frame for amount > 10000
Expected Output:
amount city
12145 ng
14000 wy
18000 la
In R i can do something like this:
df1 <- df[df$amount > 10000 ,]
I know I can use SQL of spark to do the same, but what is the step which is similar to above
From the docs:
http://spark.apache.org/docs/latest/sql-programming-guide.html#untyped-dataset-operations-aka-dataframe-operations
val df1 = df.filter($"amount" > 10000)

Collapse a Spark DataFrame

I am using Spark with Scala. Spark version 1.5 and I am trying to transform input dataframe which has name value combination to a new dataframe in which all name to be transposed to columns and values as rows.
I/P DataFrame:
ID Name Value
1 Country US
2 Country US
2 State NY
3 Country UK
4 Country India
4 State MH
5 Country US
5 State NJ
5 County Hudson
Link here for image
Transposed DataFrame
ID Country State County
1 US NULL NULL
2 US NY NULL
3 UK NULL NULL
4 India MH NULL
5 US NJ Hudson
Link to transposed image
Seems like pivot would help in this use case, but its not supported in spark 1.5.x version.
Any pointers/help?
This is a really ugly data but you can always filter and join:
val names = Seq("Country", "State", "County")
names.map(name =>
df.where($"Name" === name).select($"ID", $"Value".alias("name"))
).reduce((df1, df2) => df1.join(df2, Seq("ID"), "leftouter"))
map creates a list of three DataFrames where each table contains records containing only a single name. Next we simply reduce this list using left outer join. So putting it all together you get something like this:
(left-outer-join
(left-outer-join
(where df (=== name "Country"))
(where df (=== name "State")))
(where df (=== name "County")))
Note: If you use Spark >= 1.6 with Python or Scala, or Spark >= 2.0 with R, just use pivot with first:
Reshaping/Pivoting data in Spark RDD and/or Spark DataFrames
How to pivot DataFrame?