I've a DataFrame where I want to get the column names of the columns that contains one or more null values in them.
So far what I've done :
df.select([c for c in tbl_columns_list if df.filter(F.col(c).isNull()).count() > 0]).columns
I have almost 500 columns in my dataframe and when I execute that code, it becomes incredibly slow for a reason I don't know. Do you have any clue how can I make it work and how can I optimize that please? I need optimized solution in Pyspark please. Thanks in advance.
I am working in Spark & Scala and have a dataframe with several hundred columns. I would like to sort the dataframe by every column. Is there anyway to do this in Scala/Spark?
I have tried:
val sortedDf = actualDF.sort(actualDF.columns)
but .sort does not support Array[String] input.
This question has been asked before: Sort all columns of a dataframe but there is no Scala answer
Thank you to #blackbishop for the answer to this:
val dfSortedByAllItsColumns = actualDF.sort(actualDF.columns.map(col): _*)
PySpark Dataframe Groupby and Count Null Values
Referring to the solution link above, I am trying to apply the same logic but groupby("country") and getting the null count of another column, and I am getting a "column is not iterable" failure. Can someone help with this?
df7.groupby("country").agg(*(sum(col(c).isNull().cast("int")).alias(c) for c in columns))
covid_india_df.select(
[
funcs.count(
funcs.when((funcs.isnan(clm) | funcs.col(clm).isNull()), clm)
).alias(clm) for clm in covid_india_df.columns
]
).show()
The above approach may help you to get correct results. Check here for a complete example.
I have a Dataframe with two columns as below
Now I want to check in Outliers column if 'col' == 'city' then the corresponding 'reco'=Sancak (assign the Recommendation column value to it)
How can I achieve this?
Thanks in advance for your help!!
I have two dataframes aaa_01 and aaa_02 in Apache Spark 2.1.0.
And I perform an Inner Join on these two dataframes selecting few colums from both dataframes to appear in the output.
The Join is working perfectly fine but the output dataframe has the column names as it was present in the input dataframes. I get stuck here. I need to have new column names instead of getting the same column names in my output dataframe.
Sample Code is given below for reference
DF1.alias("a").join(DF2.alias("b"),DF1("primary_col") === DF2("primary_col"), "inner").select("a.col1","a.col2","b.col4")
I am getting the output dataframe with column names as "col1, col2, col3". I tried to modify the code as below but in vain
DF1.alias("a").join(DF2.alias("b"),DF1("primary_col") === DF2("primary_col"), "inner").select("a.col1","a.col2","b.col4" as "New_Col")
Any help is appreciated. Thanks in advance.
Edited
I browsed and got similar posts which is given below. But I do not see an answer to my question.
Updating a dataframe column in spark
Renaming Column names of a Data frame in spark scala
The answers in this post : Spark Dataframe distinguish columns with duplicated name are not relevant to me as it is related more to pyspark than Scala and it had explained how to rename all the columns of a dataframe whereas my requirement is to rename only one or few columns.
You want to rename columns of the dataset, the fact that your dataset comes from a join does not change anything. Yo can try any example from this answer, for instance :
DF1.alias("a").join(DF2.alias("b"),DF1("primary_col") === DF2("primary_col"), "inner")
.select("a.col1","a.col2","b.col4")
.withColumnRenamed("col4","New_col")
you can .as alias as
import sqlContext.implicits._
DF1.alias("a").join(DF2.alias("b"),DF1("primary_col") === DF2("primary_col"), "inner").select($"a.col1".as("first"),$"a.col2".as("second"),$"b.col4".as("third"))
or you can use .alias as
import sqlContext.implicits._
DF1.alias("a").join(DF2.alias("b"),DF1("primary_col") === DF2("primary_col"), "inner").select($"a.col1".alias("first"),$"a.col2".alias("second"),$"b.col4".alias("third"))
if you are looking to update only one column name then you can do
import sqlContext.implicits._
DF1.alias("a").join(DF2.alias("b"),DF1("primary_col") === DF2("primary_col"), "inner").select($"a.col1", $"a.col2", $"b.col4".alias("third"))