I am trying to pivot Spark dataframe on multiple columns, I am using Pivot function, but after I add 2 columns, it is giving error like overloaded parameter.
This is the error I am getting after add the third column, overloaded
method value pivot with alternatives: (pivotColumn: String,values:
java.util.List[Any])org.apache.spark.sql.RelationalGroupedDataset
(pivotColumn: String,values:
Seq[Any])org.apache.spark.sql.RelationalGroupedDataset
(pivotColumn: String)org.apache.spark.sql.RelationalGroupedDataset
cannot be applied to (String, String, String)
Here's my work:
val df_new=df.join(df1, df("Col1")<=>df1("col1") && df1("col2")<=> df("col2")).groupBy(df("Col6"))
.agg(
sum(df("Col1")).alias("Col1"),
sum(df("Col2")).alias("Col2") ,
sum(df("Col3")).alias("Col3") ,
sum(df("Col4")).alias("Col4") ,
sum(df("Col5")).alias("Col5")
).select(
Amount,'Col1, 'Col2,'Col3,'Col4,'Col5
)
--Pivot
val pivotdf=df_new.groupBy($"Col1").
pivot("Col1","Col2","Col3","col4")
I have to pivot on col1,Col2,col3,col4 and col5.Please guide me How can I do that.
Related
I have a scenario that to read a column from DataFrame by using another column from same DataFrame through where condition and this value pass through as IN condition to select same value from another DataFrame and how can I achieve in spark DataFrame.
In SQL it will be like:
select distinct(A.date) from table A where A.key in (select B.key from table B where cond='D');
I tried like below:
val Bkey: DataFrame = b_df.filter(col("cond")==="D").select(col("key"))
I have table A data in a_df DataFrame and table B data in b_df DataFrame. How can I pass variable Bkey value to outer query and achieve in Spark?
You can do a semi join:
val result = a_df.join(b_df.filter(col("cond")==="D"), Seq("key"), "left_semi").select("date").distinct()
i have the following code. df3 is created using the following code.i want to get the minimum value of distance_n and also the entire row containing that minimum value .
//it give just the min value , but i want entire row containing that min value
for getting the entire row , i converted this df3 to table for performing spark.sql
if i do like this
spark.sql("select latitude,longitude,speed,min(distance_n) from table1").show()
//it throws error
and if
spark.sql("select latitude,longitude,speed,min(distance_nd) from table180").show()
// by replacing the distance_n with distance_nd it throw the error
how to resolve this to get the entire row corresponding to min value
Before using a custom UDF, you have to register it in spark's sql Context.
e.g:
spark.sqlContext.udf.register("strLen", (s: String) => s.length())
After the UDF is registered, you can access it in your spark sql like
spark.sql("select strLen(some_col) from some_table")
Reference: https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html
I have a spark dataframe like below
id|name|age|sub
1 |ravi|21 |[M,J,J,K]
I don't want to explode on the column "sub" as it will create another extra set of rows. I want generate unique values from the "sub" column and assign it to new column sub_unique.
My output should be like
id|name|age|sub_unique
1 |ravi|21 |[M,J,K]
You can use udf
val distinct = udf((x: Seq[String]) => if (s != null) x.distinct else Seq[String]())
df.withColumn("subm_unique", distinct($"sub"))
I have a dataframe (df1) which has 50 columns, the first one is a cust_id and the rest are features. I also have another dataframe (df2) which contains only cust_id. I'd like to add one records per customer in df2 to df1 with all the features as 0. But as the two dataframe have two different schema, I cannot do a union. What is the best way to do that?
I use a full outer join but it generates two cust_id columns and I need one. I should somehow merge these two cust_id columns but don't know how.
You can try to achieve something like that by doing a full outer join like the following:
val result = df1.join(df2, Seq("cust_id"), "full_outer")
However, the features are going to be null instead of 0. If you really need them to be zero, one way to do it would be:
val features = df1.columns.toSet - "cust_id" // Remove "cust_id" column
val newDF = features.foldLeft(df2)(
(df, colName) => df.withColumn(colName, lit(0))
)
df1.unionAll(newDF)
I have joined two Dataframes in spark using below code -
Dataframes are: expDataFrame, accountList
val expDetails = expDataFrame.as("fex").join(accountList.as("acctlist"),$"fex.acct_id" === $"acctlist.acct_id", "inner")
Now I am trying to show both acct_id from both dataframe.
I have done below code -
expDetails.select($"fex.acct_id",$"acct_id.acct_id").show
but getting same column name twice as acct_id
I want two unique column name like fex_acct_id, acctlist_acct_id to identify the column from which dataframe.
You simply have to add an alias to the columns using the as or alias methods. This will do the job :
expDetails.select(
$"fex.acct_id".as("fex_acct_id"),
$"acct_id.acct_id".as("acctlist_acct_id")
).show