Spark scala remove columns containing only null values

Spark scala remove columns containing only null values - scala

Is there a way to remove the columns of a spark dataFrame that contain only null values ?
(I am using scala and Spark 1.6.2)
At the moment I am doing this:
var validCols: List[String] = List()
for (col <- df_filtered.columns){
val count = df_filtered
.select(col)
.distinct
.count
println(col, count)
if (count >= 2){
validCols ++= List(col)
}
}
to build the list of column containing at least two distinct values, and then use it in a select().
Thank you !

I had the same problem and i came up with a similar solution in Java. In my opinion there is no other way of doing it at the moment.
for (String column:df.columns()){
long count = df.select(column).distinct().count();
if(count == 1 && df.select(column).first().isNullAt(0)){
df = df.drop(column);
}
}
I'm dropping all columns containing exactly one distinct value and which first value is null. This way I can be sure that i don't drop columns where all values are the same but not null.

Here's a scala example to remove null columns that only queries that data once (faster):
def removeNullColumns(df:DataFrame): DataFrame = {
var dfNoNulls = df
val exprs = df.columns.map((_ -> "count")).toMap
val cnts = df.agg(exprs).first
for(c <- df.columns) {
val uses = cnts.getAs[Long]("count("+c+")")
if ( uses == 0 ) {
dfNoNulls = dfNoNulls.drop(c)
}
}
return dfNoNulls
}

A more idiomatic version of #swdev answer:
private def removeNullColumns(df:DataFrame): DataFrame = {
val exprs = df.columns.map((_ -> "count")).toMap
val cnts = df.agg(exprs).first
df.columns
.filter(c => cnts.getAs[Long]("count("+c+")") == 0)
.foldLeft(df)((df, col) => df.drop(col))
}

If the dataframe is of reasonable size, I write it as json then reload it. The dynamic schema will ignore null columns and you'd have a lighter dataframe.
scala snippet:
originalDataFrame.write(tempJsonPath)
val lightDataFrame = spark.read.json(tempJsonPath)

here's #timo-strotmann solution in pySpark syntax:
for column in df.columns:
count = df.select(column).distinct().count()
if count == 1 and df.first()[column] is None:
df = df.drop(column)

Related

Dynamic conversion of Array of double columns into multiple columns in nested spark dataframe

My current DataFrame looks like as below:
{"id":"1","inputs":{"values":{"0.2":[1,1],"0.4":[1,1],"0.6":[1,1]}},"id1":[1,2]}
I want to transform this dataframe into the below dataFrame:
{"id":"1", "v20":[1,1],"v40":[1,1],"v60":[1,1],"id1":[1,2]}
This means that, each 'values' array's items (0.2, 0.4 and 0.6) will be multiplied by 100, prepended with the letter 'v', and extracted into separate columns.
How does the code would look like in order to achieve this. I have tried withColumn but couldn't achieve this.

Try the below code and please find the inline comments for the code explanation
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
object DynamicCol {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val df = spark.read.json("src/main/resources/dyamicCol.json") /// Load the JSON file
val dfTemp = df.select(col("inputs.values").as("values")) // Temp Dataframe for fetching the nest values
val index = dfTemp
.schema.fieldIndex("values")
val propSchema = dfTemp.schema(index).dataType.asInstanceOf[StructType]
val dfFinal = propSchema.fields.foldLeft(df)( (df,field) => { // Join Dataframe with the list of nested columns
val colNameInt = (field.name.toDouble * 100).toInt
val colName = s"v$colNameInt"
df.withColumn(colName,col("inputs.values.`" + field.name + "`")) // Add the nested column mappings
} ).drop("inputs") // Drop the extra column
dfFinal.write.mode(SaveMode.Overwrite).json("src/main/resources/dyamicColOut.json") // Output the JSON file
}
}

I would make the logic for the change of column name splitter into 2 parts, the one that is a numeric value, and the one that doesn't change.
def stringDecimalToVNumber(colName:String): String =
"v" + (colName.toFloat * 100).toInt.toString
and form a single function that transforms according to the case
val floatRegex = """(\d+\.?\d*)""".r
def transformColumnName(colName:String): String = colName match {
case floatRegex(v) => stringDecimalToVNumber(v) //it's a float, transform it
case x => x // keep it
now we have the function to transform the end of the columns, let's pick the schema dynamicly.
val flattenDF = df.select("id","inputs.values.*")
val finalDF = flattenDF
.schema.names
.foldLeft(flattenDF)((dfacum,x) => {
val newName = transformColumnName(x)
if (newName == x)
dfacum // the name didn't need to be changed
else
dfacum.withColumnRenamed(x, transformColumnName(x))
})
This will dynamically transform all the columns inside inputs.values to the new name, and put them in next to id.

How do I send multiple columns to a udf from a When Clause in Spark dataframe?

I want to join two dataframes on basis on full_outer_join and trying to add a new column in the joined result set which tells me the matching records, unmatched records from left dataframe alone and unmatched records from right dataframe alone.
Here is my spark code:
val creditLoc ="/data/accounts/credits/year=2016/month=06/day=02"
val debitLoc = "/data/accounts/debits/year=2016/month=06/day=02"
val creditDF = sqlContext.read.avro(creditLoc)
val debitDF = sqlContext.read.avro(debitLoc)
val credit = creditDF.withColumnRenamed("account_id","credit_account_id").as("credit")
val debit = debitDF.withColumnRenamed("account_id","debit_account_id").as("debit")
val fullOuterDF = credit.join(debit,credit("credit_account_id") === debit("debit_account_id"),"full_outer")
val CREDIT_DEBIT_CONSOLIDATE_SCHEMA=List(
("credit.credit_account_id","string"),
("credit.channel_name", "string"),
("credit.service_key", "string"),
("credit.trans_id", "string"),
("credit.trans_dt", "string"),
("credit.trans_amount", "string"),
("debit.debit_account_id","string"),
("debit.icf_number","string"),
("debit.debt_amount","string")
)
val columnNamesList = CREDIT_DEBIT_CONSOLIDATE_SCHEMA.map(elem => col(elem._1)).seq
val df = fullOuterDF.select(columnNamesList:_*)
val caseDF = df.withColumn("matching_type",
when(df("credit_account_id") === df("debit_account_id"),"MATCHING_CREDIT_DEBIT").otherwise(
when(df("debit_account_id").isNull,"UNMATCHED_CREDIT").otherwise(
when(df("credit_account_id").isNull,"UNMATCHED_DEBIT").otherwise("INVALID_MATCHING_TYPE")
)
)
)
As of now I applied the logic for "matching_type" inside a when clause itself, but now I want to write the logic of "matching_type" inside an UDF.
If the write like the above the code works.
The below UDFs accepts a single column as parameter, how do I create a udf that accepts multiple columns and return a boolean based on conditions inside that udf?
val isUnMatchedCREDIT = udf[Boolean, String](credit_account_id => {
credit_account_id == null
})
val isUnMatchedDEBIT = udf[Boolean, String](debit_account_id => {
debit_account_id == null
})
val caseDF = df.withColumn("matching_type",
when(df("credit_account_id") === df("debit_account_id"),"MATCHING_CREDIT_DEBIT").otherwise(
when(isUnMatchedCREDIT(df("credit_account_id")),"UNMATCHED_CREDIT").otherwise(
when(isUnMatchedDEBIT(df("debit_account_id")),"UNMATCHED_DEBIT").otherwise("INVALID_MATCHING_TYPE")
)
)
)
)
Basically I want to create another UDF as isMatchedCREDITDEBIT() that accepts two columns credit_account_id and debit_account_id and that UDF should return true if both the values are equal else false. In simple words, I want to created an UDF for the below logic:
when(df("credit_account_id") === df("debit_account_id"),"MATCHING_CREDIT_DEBIT")
I have tried this but it is throwing compile type error:
val isMatchedCREDITDEBIT()= udf[Boolean, String,String](credit_account_id => {
credit_account_id == debit_account_id
})
Could someone help me on this?

You can create an udf that takes two columns and perform your logic like this:
val isMatchedCREDITDEBIT = udf((credit_account_id: String, debit_account_id: String) => {
credit_account_id == debit_account_id
})
which can be called in the when clause
when(isMatchedCREDITDEBIT(df("credit_account_id"), df("debit_account_id")), "MATCHING_CREDIT_DEBIT")
However, it would be easier to create a single udf for all the logic you are performing on the two columns. The udf below takes both columns as input and returns the string you want, instead of a boolean.
val isMatchedCREDITDEBIT = udf((credit_account_id: String, debit_account_id: String) => {
if(credit_account_id == null){
"UNMATCHED_CREDIT"
} else if (debit_account_id == null){
"UNMATCHED_DEBIT"
} else if (credit_account_id == debit_account_id){
"MATCHING_CREDIT_DEBIT"
} else {
"INVALID_MATCHING_TYPE"
}
})
val caseDF = df.withColumn("matching_type",
isMatchedCREDITDEBIT(df("credit_account_id"), df("debit_account_id")))

scala use a method against a table

I have a table test with two columns a: String and b: String. I'm trying to apply some functions to the values in a, say if a = 23 and a-b<5 (real logic might be more complicated than this), then create another table with column c as "yes".
I tried to create a case class num as below, convert table to the class night, and apply that function to the table. How should I do this, or is this doable? Many thanks!
case class num (a:String, b:String){
def howmany ={
// how should I put the logics here?
}
}
sqlContext.table("test").as[num].//how can I then apply function `howmany`here?

Assuming you have a DataFrame of "num", you can put your logic in an UDF and use DataFrame's withColumn API to call your UDF and add the new column. You can find more details of this method here.

I think this is what you need.
val spark = SparkSession
.builder()
.master("local")
.appName("ParquetAppendMode")
.getOrCreate()
import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
("123", "125"),
("23", "25"),
("5", "9")
)).toDF("a", "b") // create a dataframe with dummy data
val logic = udf((a: String, b: String) => {
if (a.toInt ==23 && (a.toInt - b.toInt) <5 ) "yes" else "no"
})
// add column with applying logic
val result = data.withColumn("newCol", logic(data("a"), data("b")))
result.show()

You can use Dataset's map:
sqlContext.table("test").as[num]
.map(num => if(a == 23 && a-b < 5) (num, "yes") else (num, "no"))
Or you could just use a filter followed by a select:
sqlContext.table("test").where($"a" == 23 && ($"a"-$"b") < 5).select($"*", "yes".as("newCol"))

How to skip line in spark rdd map action based on if condition

I have a file and I want to give it to an mllib algorithm. So I am following the example and doing something like:
val data = sc.textFile(my_file).
map {line =>
val parts = line.split(",");
Vectors.dense(parts.slice(1, parts.length).map(x => x.toDouble).toArray)
};
and this works except that sometimes I have a missing feature. That is sometimes one column of some row does not have any data and I want to throw away rows like this.
So I want to do something like this map{line => if(containsMissing(line) == true){ skipLine} else{ ... //same as before}}
how can I do this skipLine action?

You can use filter function to filter out such lines:
val data = sc.textFile(my_file)
.filter(_.split(",").length == cols)
.map {line =>
// your code
};
Assuming variable cols holds number of columns in a valid row.

You can use flatMap, Some and None for this:
def missingFeatures(stuff): Boolean = ??? // Determine if features is missing
val data = sc.textFile(my_file)
.flatMap {line =>
val parts = line.split(",");
if(missingFeatures(parts)) None
else Some(Vectors.dense(parts.slice(1, parts.length).map(x => x.toDouble).toArray))
};
This way you avoid mapping over the rdd more than once.

Java code to skip empty lines / header from Spark RDD:
First the imports:
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
Now, the filter compares total columns to 17 or header column which starts with VendorID.
Function<String, Boolean> isInvalid = row -> (row.split(",").length == 17 && !(row.startsWith("VendorID")));
JavaRDD<String> taxis = sc.textFile("datasets/trip_yellow_taxi.data")
.filter(isInvalid);

Converting Spark Dataframe to RDD in scala

I am cheking for better approch to convert Dataframe to RDD. Right now I am converting dataframe to collection and looping collection to prepare RDD. But we know looping is not good practice.
val randomProduct = scala.collection.mutable.MutableList[Product]()
val results = hiveContext.sql("select id,value from details");
val collection = results.collect();
var i = 0;
results.collect.foreach(t => {
val product = new Product(collection(i)(0).asInstanceOf[Long], collection(i)(1).asInstanceOf[String]);
i = i+ 1;
randomProduct += product
})
randomProduct
//returns RDD[Product]
Please suggest me to make it standard & stable format which works for huge amount of data.

val results = hiveContext.sql("select id,value from details");
results.rdd.map( row => new Product( row.getLong(0), row.getString(1) ) ) // RDD[Product]

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark scala remove columns containing only null values - scala

A more idiomatic version of #swdev answer: private def removeNullColumns(df:DataFrame): DataFrame = { val exprs = df.columns.map((_ -> "count")).toMap val cnts = df.agg(exprs).first df.columns .filter(c => cnts.getAs[Long]("count("+c+")") == 0) .foldLeft(df)((df, col) => df.drop(col)) }

If the dataframe is of reasonable size, I write it as json then reload it. The dynamic schema will ignore null columns and you'd have a lighter dataframe. scala snippet: originalDataFrame.write(tempJsonPath) val lightDataFrame = spark.read.json(tempJsonPath)

here's #timo-strotmann solution in pySpark syntax: for column in df.columns: count = df.select(column).distinct().count() if count == 1 and df.first()[column] is None: df = df.drop(column)

Related

Dynamic conversion of Array of double columns into multiple columns in nested spark dataframe

How do I send multiple columns to a udf from a When Clause in Spark dataframe?

scala use a method against a table

How to skip line in spark rdd map action based on if condition

Converting Spark Dataframe to RDD in scala

Categories

Resources