groupBy on dataframe in scala

groupBy on dataframe in scala - scala

I am trying to do groupBy on dataframe with two columns whereas second column has a category value. Plz help me for the correct syntax in scala .
I tried this way but its wrong.
df.groupBy("col1", "col2" == "Buy").count
Thanks.

Related

Pyspark to get the column names of the columns that contains null values

I've a DataFrame where I want to get the column names of the columns that contains one or more null values in them.
So far what I've done :
df.select([c for c in tbl_columns_list if df.filter(F.col(c).isNull()).count() > 0]).columns
I have almost 500 columns in my dataframe and when I execute that code, it becomes incredibly slow for a reason I don't know. Do you have any clue how can I make it work and how can I optimize that please? I need optimized solution in Pyspark please. Thanks in advance.

Sort every column of a dataframe in spark scala

I am working in Spark & Scala and have a dataframe with several hundred columns. I would like to sort the dataframe by every column. Is there anyway to do this in Scala/Spark?
I have tried:
val sortedDf = actualDF.sort(actualDF.columns)
but .sort does not support Array[String] input.
This question has been asked before: Sort all columns of a dataframe but there is no Scala answer

Thank you to #blackbishop for the answer to this:
val dfSortedByAllItsColumns = actualDF.sort(actualDF.columns.map(col): _*)

Pyspark groupby and count null values

PySpark Dataframe Groupby and Count Null Values
Referring to the solution link above, I am trying to apply the same logic but groupby("country") and getting the null count of another column, and I am getting a "column is not iterable" failure. Can someone help with this?
df7.groupby("country").agg(*(sum(col(c).isNull().cast("int")).alias(c) for c in columns))

covid_india_df.select(
[
funcs.count(
funcs.when((funcs.isnan(clm) | funcs.col(clm).isNull()), clm)
).alias(clm) for clm in covid_india_df.columns
]
).show()
The above approach may help you to get correct results. Check here for a complete example.

Spark scala : Append String to Dataframe's Existing Column according to condition

I have a Dataframe with two columns as below
Now I want to check in Outliers column if 'col' == 'city' then the corresponding 'reco'=Sancak (assign the Recommendation column value to it)
How can I achieve this?
Thanks in advance for your help!!

Updating Dataframe Column name in Spark - Scala while performing Joins

I have two dataframes aaa_01 and aaa_02 in Apache Spark 2.1.0.
And I perform an Inner Join on these two dataframes selecting few colums from both dataframes to appear in the output.
The Join is working perfectly fine but the output dataframe has the column names as it was present in the input dataframes. I get stuck here. I need to have new column names instead of getting the same column names in my output dataframe.
Sample Code is given below for reference
DF1.alias("a").join(DF2.alias("b"),DF1("primary_col") === DF2("primary_col"), "inner").select("a.col1","a.col2","b.col4")
I am getting the output dataframe with column names as "col1, col2, col3". I tried to modify the code as below but in vain
DF1.alias("a").join(DF2.alias("b"),DF1("primary_col") === DF2("primary_col"), "inner").select("a.col1","a.col2","b.col4" as "New_Col")
Any help is appreciated. Thanks in advance.
Edited
I browsed and got similar posts which is given below. But I do not see an answer to my question.
Updating a dataframe column in spark
Renaming Column names of a Data frame in spark scala
The answers in this post : Spark Dataframe distinguish columns with duplicated name are not relevant to me as it is related more to pyspark than Scala and it had explained how to rename all the columns of a dataframe whereas my requirement is to rename only one or few columns.

You want to rename columns of the dataset, the fact that your dataset comes from a join does not change anything. Yo can try any example from this answer, for instance :
DF1.alias("a").join(DF2.alias("b"),DF1("primary_col") === DF2("primary_col"), "inner")
.select("a.col1","a.col2","b.col4")
.withColumnRenamed("col4","New_col")

you can .as alias as
import sqlContext.implicits._
DF1.alias("a").join(DF2.alias("b"),DF1("primary_col") === DF2("primary_col"), "inner").select($"a.col1".as("first"),$"a.col2".as("second"),$"b.col4".as("third"))
or you can use .alias as
import sqlContext.implicits._
DF1.alias("a").join(DF2.alias("b"),DF1("primary_col") === DF2("primary_col"), "inner").select($"a.col1".alias("first"),$"a.col2".alias("second"),$"b.col4".alias("third"))
if you are looking to update only one column name then you can do
import sqlContext.implicits._
DF1.alias("a").join(DF2.alias("b"),DF1("primary_col") === DF2("primary_col"), "inner").select($"a.col1", $"a.col2", $"b.col4".alias("third"))

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

groupBy on dataframe in scala - scala

I am trying to do groupBy on dataframe with two columns whereas second column has a category value. Plz help me for the correct syntax in scala . I tried this way but its wrong. df.groupBy("col1", "col2" == "Buy").count Thanks.

Related

Pyspark to get the column names of the columns that contains null values

Sort every column of a dataframe in spark scala

Pyspark groupby and count null values

Spark scala : Append String to Dataframe's Existing Column according to condition

Updating Dataframe Column name in Spark - Scala while performing Joins

Categories

Resources