Spark replace all NaNs to null in DataFrame API

Spark replace all NaNs to null in DataFrame API - scala

I have a dataframe with many double (and/or float) columns, which do contain NaNs. I want to replace all NaNs (i.e. Float.NaN and Double.NaN) with null.
I can do this with e.g. for a single column x:
val newDf = df.withColumn("x", when($"x".isNaN,lit(null)).otherwise($"x"))
This works but I'd like to do this for all columns at once. I recently discovered the DataFrameNAFunctions (df.na) fill which sounds exactely what I need. Unfortunately I failed to do the above. fill should replace all NaNs and nulls with a given value, so I do:
df.na.fill(null.asInstanceOf[java.lang.Double]).show
which gives me a NullpointerException
There is also a promising replace method, but I cant even compile the code:
df.na.replace("x", Map(java.lang.Double.NaN -> null.asInstanceOf[java.lang.Double])).show
strangely, this gives me
Error:(57, 34) type mismatch;
found : scala.collection.immutable.Map[scala.Double,java.lang.Double]
required: Map[Any,Any]
Note: Double <: Any, but trait Map is invariant in type A.
You may wish to investigate a wildcard type such as `_ <: Any`. (SLS 3.2.10)
df.na.replace("x", Map(java.lang.Double.NaN -> null.asInstanceOf[java.lang.Double])).show

To replace all NaN(s) with null in Spark you just have to create a Map of replace values for every column, like this:
val map = df.columns.map((_, "null")).toMap
Then you can use fill to replace NaN(s) with null values:
df.na.fill(map)
For Example:
scala> val df = List((Float.NaN, Double.NaN), (1f, 0d)).toDF("x", "y")
df: org.apache.spark.sql.DataFrame = [x: float, y: double]
scala> df.show
+---+---+
| x| y|
+---+---+
|NaN|NaN|
|1.0|0.0|
+---+---+
scala> val map = df.columns.map((_, "null")).toMap
map: scala.collection.immutable.Map[String,String] = Map(x -> null, y -> null)
scala> df.na.fill(map).printSchema
root
|-- x: float (nullable = true)
|-- y: double (nullable = true)
scala> df.na.fill(map).show
+----+----+
| x| y|
+----+----+
|null|null|
| 1.0| 0.0|
+----+----+
I hope this helps !

To Replace all NaN by any value in Spark Dataframe using Pyspark API you can do the following:
col_list = [column1, column2]
df = df.na.fill(replace_by_value, col_list)

Related

Is there an explode function equivalent in plain Scala?

I am trying to look for explode function or its equivalent in plain scala rather Spark.
Using the explode function in Spark, I was able to flatten a row with multiple elements into multiple rows as below.
scala> import org.apache.spark.sql.functions.explode
import org.apache.spark.sql.functions.explode
scala> val test = spark.read.json(spark.sparkContext.parallelize(Seq("""{"a":1,"b":[2,3]}""")))
scala> test.schema
res1: org.apache.spark.sql.types.StructType = StructType(StructField(a,LongType,true), StructField(b,ArrayType(LongType,true),true))
scala> test.show
+---+------+
| a| b|
+---+------+
| 1|[2, 3]|
+---+------+
scala> val flat = test.withColumn("b",explode($"b"))
flat: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint]
scala> flat.show
+---+---+
| a| b|
+---+---+
| 1| 2|
| 1| 3|
+---+---+
Is there an explode equivalent function in plain scala without using Spark ? Is there anyway I can implement it if there is no explode function available in scala ?

Simple flatMap should help you in this case. I don't know exact data structure, which you would like to work with in scala, but let's take a bit artificial example:
val l: List[(Int, List[Int])] = List(1 -> List(2, 3))
val result: List[(Int, Int)] = l.flatMap {
case (a, b) => b.map(i => a -> i)
}
println(result)
Which will produce next result:
List((1,2), (1,3))
UPDATE
As suggested in comment section by #jwvh, or same result may be achieved using for-comprehension construction and hiding explicit flatMap & map invocations:
val result2: List[(Int, Int)] = for((a, bList) <- l; b <- bList) yield a -> b
Hope this helps!

cas Exception:scala.collection.mutable.WrappedArray$ofRef cannot be cast to [D

throwing ClassCastExpection when applying knn classifier
val df = training.map{ r =>
(Vectors.dense(r.getAs[Array[Double]]("features")),r.getAs[Int]("id"))
}.toDF("features","id")
error appear
scala.collection.mutable.WrappedArray$ofRef cannot be cast to [D
I try Seq, WrappedArray but does't work.

I am going to assume the following schema for training:
id:Integer
features: Array[Double]
Try:
val df = training.map(r => (Vectors.dense(r.getAs[Seq[Double]]("features").toArray),r.getAs[Integer]("id"))).toDF("features","id")
Datasets internally store Array objects are WrappedArray, a quick intro of which can be found here.
Array vs Wrapped array
So, you should "extract" your array of doubles by casting it to Seq[Double] instead of Array[Double]. However, the method dense needs Array[Double]. So, convert the Seq[Double] to Array[Double] using the toArray method.
val training = List((Seq(0.0,0.0),2),(Seq(1.0,1.0),5)).toDF("features","id")
training.show
+----------+---+
| features| id|
+----------+---+
|[0.0, 0.0]| 2|
|[1.0, 1.0]| 5|
+----------+---+
training: org.apache.spark.sql.DataFrame = [features: array<double>, id: int]
val df = training.map(r => (Vectors.dense(r.getAs[Seq[Double]]("training").toArray),r.getAs[Integer]("id"))).toDF("features","id")
df.show
+---------+---+
| features| id|
+---------+---+
|[0.0,0.0]| 2|
|[1.0,1.0]| 5|
+---------+---+
df: org.apache.spark.sql.DataFrame = [features: vector, id: int]
Hope this helps.

Convert spark dataframe to sequence of sequences and vice versa in Scala [duplicate]

This question already has an answer here:
How to get Array[Seq[String]] from DataFrame?
(1 answer)
Closed 3 years ago.
I have a DataFrame and I want to convert it into a sequence of sequences and vice versa.
Now the thing is, I want to do it dynamically, and write something which runs for DataFrame with any number/type of columns.
In summary, these are the questions:
How to convert Seq[Seq[String]] to a DataFrame?
How to convert DataFrame to Seq[Seq[String]?
How to perform 2 but also make the DataFrame infer the schema and decide column types by itself?
UPDATE 1
This is not a duplicate of this question because in answer to that question solution provided is not dynamic, it works for two columns or how many columns is to be hardcoded. I am trying to find a dynamic solution.

This is how you can dynamically create a dataframe from Seq[Seq[String]]:
scala> val seqOfSeq = Seq(Seq("a","b", "c"),Seq("3","4", "5"))
seqOfSeq: Seq[Seq[String]] = List(List(a, b, c), List(3, 4, 5))
scala> val lengthOfRow = seqOfSeq(0).size
lengthOfRow: Int = 3
scala> val tempDf = sc.parallelize(seqOfSeq).toDF
tempDf: org.apache.spark.sql.DataFrame = [value: array<string>]
scala> val requiredDf = tempDf.select((0 until lengthOfRow).map(i => col("value")(i).alias(s"col$i")): _*)
requiredDf: org.apache.spark.sql.DataFrame = [col0: string, col1: string ... 1 more field]
scala> requiredDf.show
+----+----+----+
|col0|col1|col2|
+----+----+----+
| a| b| c|
| 3| 4| 5|
+----+----+----+
How to convert DataFrame to Seq[Seq[String]:
val newSeqOfSeq = requiredDf.collect().map(row => row.toSeq.map(_.toString).toSeq).toSeq
To use custom column names:
scala> val myCols = Seq("myColA", "myColB", "myColC")
myCols: Seq[String] = List(myColA, myColB, myColC)
scala> val requiredDf = tempDf.select((0 until lengthOfRow).map(i => col("value")(i).alias( myCols(i) )): _*)
requiredDf: org.apache.spark.sql.DataFrame = [myColA: string, myColB: string ... 1 more field]

How to subtract one Scala Spark DataFrame from another (Normalise to the mean)

I have two Spark DataFrames:
df1 with 80 columns
CO01...CO80
+----+----+
|CO01|CO02|
+----+----+
|2.06|0.56|
|1.96|0.72|
|1.70|0.87|
|1.90|0.64|
+----+----+
and df2 with 80 columns
avg(CO01)...avg(CO80)
which is mean of each column
+------------------+------------------+
| avg(CO01)| avg(CO02)|
+------------------+------------------+
|2.6185106382978716|1.0080985915492937|
+------------------+------------------+
How can i subtract df2 from df1 for corresponding values?
I'm looking for solution that does not require to list all the columns.
P.S
In pandas it could be simply done by:
df2=df1-df1.mean()

Here is what you can do
scala> val df = spark.sparkContext.parallelize(List(
| (2.06,0.56),
| (1.96,0.72),
| (1.70,0.87),
| (1.90,0.64))).toDF("c1","c2")
df: org.apache.spark.sql.DataFrame = [c1: double, c2: double]
scala>
scala> def subMean(mean: Double) = udf[Double, Double]((value: Double) => value - mean)
subMean: (mean: Double)org.apache.spark.sql.expressions.UserDefinedFunction
scala>
scala> val result = df.columns.foldLeft(df)( (df, col) =>
| { val avg = df.select(mean(col)).first().getAs[Double](0);
| df.withColumn(col, subMean(avg)(df(col)))
| })
result: org.apache.spark.sql.DataFrame = [c1: double, c2: double]
scala>
scala> result.show(10, false)
+---------------------+---------------------+
|c1 |c2 |
+---------------------+---------------------+
|0.15500000000000025 |-0.13749999999999996 |
|0.05500000000000016 |0.022499999999999964 |
|-0.20499999999999985 |0.1725 |
|-0.004999999999999893|-0.057499999999999996|
+---------------------+---------------------+
Hope, this helps!
Please note that, this will work for n number of columns as long as all columns in dataframe are of numeric type

Convert a row to a list in spark scala

Is that possible to do that? All the data in my dataframe(~1000 cols) are Doubles and I'm wondering whether I could turn a row of data to a list of Doubles?

You can use toSeq method on the Row and then convert the type from Seq[Any] to Seq[Double](if you are sure the data types of all the columns are Double):
val df = Seq((1.0,2.0),(2.1,2.2)).toDF("A", "B")
// df: org.apache.spark.sql.DataFrame = [A: double, B: double]
df.show
+---+---+
| A| B|
+---+---+
|1.0|2.0|
|2.1|2.2|
+---+---+
df.first.toSeq.asInstanceOf[Seq[Double]]
// res1: Seq[Double] = WrappedArray(1.0, 2.0)
In case you have String type columns, use toSeq and then use map with pattern matching to convert the String to Double:
val df = Seq((1.0,"2.0"),(2.1,"2.2")).toDF("A", "B")
// df: org.apache.spark.sql.DataFrame = [A: double, B: string]
df.first.toSeq.map{
case x: String => x.toDouble
case x: Double => x
}
// res3: Seq[Double] = ArrayBuffer(1.0, 2.0)

If you have a dataframe with doubles which you want to convert into List of doubles, then just convert the dataframe into rdd which will give you RDD[Row] you can covert this to List as
dataframe.rdd.map(_.toSeq.toList)
You will get list of doubles

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark replace all NaNs to null in DataFrame API - scala

To Replace all NaN by any value in Spark Dataframe using Pyspark API you can do the following: col_list = [column1, column2] df = df.na.fill(replace_by_value, col_list)

Related

Is there an explode function equivalent in plain Scala?

cas Exception:scala.collection.mutable.WrappedArray$ofRef cannot be cast to [D

Convert spark dataframe to sequence of sequences and vice versa in Scala [duplicate]

How to subtract one Scala Spark DataFrame from another (Normalise to the mean)

Convert a row to a list in spark scala

Categories

Resources