How to Pivot on Multiple column on Spark Dataframe

How to Pivot on Multiple column on Spark Dataframe - scala

I am trying to pivot Spark dataframe on multiple columns, I am using Pivot function, but after I add 2 columns, it is giving error like overloaded parameter.
This is the error I am getting after add the third column, overloaded
method value pivot with alternatives: (pivotColumn: String,values:
java.util.List[Any])org.apache.spark.sql.RelationalGroupedDa‌taset
(pivotColumn: String,values:
Seq[Any])org.apache.spark.sql.RelationalGroupedDataset
(pivotColumn: String)org.apache.spark.sql.RelationalGroupedDataset
cannot be applied to (String, String, String)
Here's my work:
val df_new=df.join(df1, df("Col1")<=>df1("col1") && df1("col2")<=> df("col2")).groupBy(df("Col6"))
.agg(
sum(df("Col1")).alias("Col1"),
sum(df("Col2")).alias("Col2") ,
sum(df("Col3")).alias("Col3") ,
sum(df("Col4")).alias("Col4") ,
sum(df("Col5")).alias("Col5")
).select(
Amount,'Col1, 'Col2,'Col3,'Col4,'Col5
)
--Pivot
val pivotdf=df_new.groupBy($"Col1").
pivot("Col1","Col2","Col3","col4")
I have to pivot on col1,Col2,col3,col4 and col5.Please guide me How can I do that.

Related

How to retrieve column value by passing another column value with IN clause in spark

I have a scenario that to read a column from DataFrame by using another column from same DataFrame through where condition and this value pass through as IN condition to select same value from another DataFrame and how can I achieve in spark DataFrame.
In SQL it will be like:
select distinct(A.date) from table A where A.key in (select B.key from table B where cond='D');
I tried like below:
val Bkey: DataFrame = b_df.filter(col("cond")==="D").select(col("key"))
I have table A data in a_df DataFrame and table B data in b_df DataFrame. How can I pass variable Bkey value to outer query and achieve in Spark?

You can do a semi join:
val result = a_df.join(b_df.filter(col("cond")==="D"), Seq("key"), "left_semi").select("date").distinct()

how to get the row corresponding to the minimum value of some column in spark scala dataframe

i have the following code. df3 is created using the following code.i want to get the minimum value of distance_n and also the entire row containing that minimum value .
//it give just the min value , but i want entire row containing that min value
for getting the entire row , i converted this df3 to table for performing spark.sql
if i do like this
spark.sql("select latitude,longitude,speed,min(distance_n) from table1").show()
//it throws error
and if
spark.sql("select latitude,longitude,speed,min(distance_nd) from table180").show()
// by replacing the distance_n with distance_nd it throw the error
how to resolve this to get the entire row corresponding to min value

Before using a custom UDF, you have to register it in spark's sql Context.
e.g:
spark.sqlContext.udf.register("strLen", (s: String) => s.length())
After the UDF is registered, you can access it in your spark sql like
spark.sql("select strLen(some_col) from some_table")
Reference: https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html

Generate distinct values from a column in a spark dataframe

I have a spark dataframe like below
id|name|age|sub
1 |ravi|21 |[M,J,J,K]
I don't want to explode on the column "sub" as it will create another extra set of rows. I want generate unique values from the "sub" column and assign it to new column sub_unique.
My output should be like
id|name|age|sub_unique
1 |ravi|21 |[M,J,K]

You can use udf
val distinct = udf((x: Seq[String]) => if (s != null) x.distinct else Seq[String]())
df.withColumn("subm_unique", distinct($"sub"))

How to insert record into a dataframe in spark

I have a dataframe (df1) which has 50 columns, the first one is a cust_id and the rest are features. I also have another dataframe (df2) which contains only cust_id. I'd like to add one records per customer in df2 to df1 with all the features as 0. But as the two dataframe have two different schema, I cannot do a union. What is the best way to do that?
I use a full outer join but it generates two cust_id columns and I need one. I should somehow merge these two cust_id columns but don't know how.

You can try to achieve something like that by doing a full outer join like the following:
val result = df1.join(df2, Seq("cust_id"), "full_outer")
However, the features are going to be null instead of 0. If you really need them to be zero, one way to do it would be:
val features = df1.columns.toSet - "cust_id" // Remove "cust_id" column
val newDF = features.foldLeft(df2)(
(df, colName) => df.withColumn(colName, lit(0))
)
df1.unionAll(newDF)

Join Dataframes in Spark

I have joined two Dataframes in spark using below code -
Dataframes are: expDataFrame, accountList
val expDetails = expDataFrame.as("fex").join(accountList.as("acctlist"),$"fex.acct_id" === $"acctlist.acct_id", "inner")
Now I am trying to show both acct_id from both dataframe.
I have done below code -
expDetails.select($"fex.acct_id",$"acct_id.acct_id").show
but getting same column name twice as acct_id
I want two unique column name like fex_acct_id, acctlist_acct_id to identify the column from which dataframe.

You simply have to add an alias to the columns using the as or alias methods. This will do the job :
expDetails.select(
$"fex.acct_id".as("fex_acct_id"),
$"acct_id.acct_id".as("acctlist_acct_id")
).show

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to Pivot on Multiple column on Spark Dataframe - scala

Related

How to retrieve column value by passing another column value with IN clause in spark

how to get the row corresponding to the minimum value of some column in spark scala dataframe

Generate distinct values from a column in a spark dataframe

How to insert record into a dataframe in spark

Join Dataframes in Spark

Categories

Resources