spark/scala drop row with nan in any column - scala

I am using Zeppelin and df is the spark DataFrame. I try to filter NaNs that can occur in any row, however it doesn't filter it out for some reason.
val df = df_labeled("df_Germany")
df: org.apache.spark.sql.DataFrame = [Kik: string, Lak: string ... 15 more fields]
df.count()
res66: Long = 455
df.na.drop().count
res66: Long = 455
How do I filter NaNs all at once?

How do I filter NaNs all at once?
generally following should work
df.na.drop
But there is an alternative to use .isNaN function on each columns that can be NaN. And we know that NaN values can be possible in Floats and Doubles, so we need to get the column names that have DoubleType or FloatType as dataTypes and do the filter as
import org.apache.spark.sql.functions._
val nan_columns = df.schema.filter(x => x.dataType == DoubleType || x.dataType == FloatType).map(_.name)
df.filter(!nan_columns.map(col(_).isNaN).reduce(_ or _))
or you can use isnan inbuilt function as
import org.apache.spark.sql.functions._
val nan_columns = df.schema.filter(x => x.dataType == DoubleType || x.dataType == FloatType).map(_.name)
df.filter(!nan_columns.map(x => isnan(col(x))).reduce(_ or _))

Assuming df is your dataframe. If you want to drop all rows(in Any Column) which NaN values in any columns. You can use
df.na.drop
If you want to fill all the NaN values with some values, you can use
df.na.fill(your_value)
On multiple columns
val cols = Seq("col1","col2")
df.na.drop(cols)
But if you want to do this Column-wise, you can do
df.filter(!$"col_name".isNaN)
Or
df.filter(!isnan($"your_column"))

Related

spark withColumn value generation from all column values

I want to add a column from all existing column values in the same row.
For example,
col1 col2 ... coln col_new
------------------ -------
True False ...False "col1-..."
False True ...True "col2-...-coln"
That is, when a value is True, then add its column name with "-" separator and keep doing the same until the last column. We don't know how many columns we will have.
How can I achieve this with withColumn() in Spark? (Scala)
If the columns are all BooleanTypes then you can write a udf function to get the new column as below
import org.apache.spark.sql.functions._
val columnNames = df.columns
def concatColNames = udf((array: collection.mutable.WrappedArray[Boolean]) => array.zip(columnNames).filter(x => x._1 == true).map(_._2).mkString("-"))
df.withColumn("col_new", concatColNames(array(df.columns.map(col): _*))).show(false)
If the columns are all StringTypes then you just need to modify the udf function as below
def concatColNames = udf((array: collection.mutable.WrappedArray[String]) => array.zip(columnNames).filter(x => x._1 == "True").map(_._2).mkString("-"))
You should get what you require

Create new DataFrame with new rows depending in number of a column - Spark Scala

I have a DataFrame with the following data:
num_cta | n_lines
110000000000| 2
110100000000| 3
110200000000| 1
With that information, I need to create a new DF with different number of rows depending the value that comes over the n_lines column.
For example, for the first row of my DF (110000000000), the value of the n_lines column is 2. The result would have to be something like the following:
num_cta
110000000000
110000000000
For all the Dataframe example that I show, the result to get would have to be something like this:
num_cta
110000000000
110000000000
110100000000
110100000000
110100000000
110200000000
Is there a way to do that? And multiply a row n times, depending on the value of a column value?
Regards.
One approach would be to expand n_lines into an array with an UDF and explode it:
val df = Seq(
("110000000000", 2),
("110100000000", 3),
("110200000000", 1)
)toDF("num_cta", "n_lines")
def fillArr = udf(
(n: Int) => Array.fill(n)(1)
)
val df2 = df.withColumn("arr", fillArr($"n_lines")).
withColumn("a", explode($"arr")).
select($"num_cta")
df2.show
+------------+
| num_cta|
+------------+
|110000000000|
|110000000000|
|110100000000|
|110100000000|
|110100000000|
|110200000000|
+------------+
There is no off the shelve way to doing this. However you can try iterate over the dataframe and return a list of num_cta where the number of elements are equal to the corresponding n_lines.
Something like
import spark.implicits._
case class (num_cta:String) // output dataframe schema
case class (num_cta:String, n_lines:Integer) // input dataframe 'df' schema
val result = df.flatmap(x => {
List.fill(x.n_lines)(x.num_cta)
}).toDF

Spark Dataframe select based on column index

How do I select all the columns of a dataframe that has certain indexes in Scala?
For example if a dataframe has 100 columns and i want to extract only columns (10,12,13,14,15), how to do the same?
Below selects all columns from dataframe df which has the column name mentioned in the Array colNames:
df = df.select(colNames.head,colNames.tail: _*)
If there is similar, colNos array which has
colNos = Array(10,20,25,45)
How do I transform the above df.select to fetch only those columns at the specific indexes.
You can map over columns:
import org.apache.spark.sql.functions.col
df.select(colNos map df.columns map col: _*)
or:
df.select(colNos map (df.columns andThen col): _*)
or:
df.select(colNos map (col _ compose df.columns): _*)
All the methods shown above are equivalent and don't impose performance penalty. Following mapping:
colNos map df.columns
is just a local Array access (constant time access for each index) and choosing between String or Column based variant of select doesn't affect the execution plan:
val df = Seq((1, 2, 3 ,4, 5, 6)).toDF
val colNos = Seq(0, 3, 5)
df.select(colNos map df.columns map col: _*).explain
== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]
df.select("_1", "_4", "_6").explain
== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]
#user6910411's answer above works like a charm and the number of tasks/logical plan is similar to my approach below. BUT my approach is a bit faster.
So,
I would suggest you to go with the column names rather than column numbers. Column names are much safer and much ligher than using numbers. You can use the following solution :
val colNames = Seq("col1", "col2" ...... "col99", "col100")
val selectColNames = Seq("col1", "col3", .... selected column names ... )
val selectCols = selectColNames.map(name => df.col(name))
df = df.select(selectCols:_*)
If you are hesitant to write all the 100 column names then there is a shortcut method too
val colNames = df.schema.fieldNames
Example: Grab first 14 columns of Spark Dataframe by Index using Scala.
import org.apache.spark.sql.functions.col
// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols.map(name=>col(name)):_*)
You cannot simply do this (as I tried and failed):
// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols)
The reason is that you have to convert your datatype of Array[String] to Array[org.apache.spark.sql.Column] in order for the slicing to work.
OR Wrap it in a function using Currying (high five to my colleague for this):
// Subsets Dataframe to using beg_val & end_val index.
def subset_frame(beg_val:Int=0, end_val:Int)(df: DataFrame): DataFrame = {
val sliceCols = df.columns.slice(beg_val, end_val)
return df.select(sliceCols.map(name => col(name)):_*)
}
// Get first 25 columns as subsetted dataframe
val subset_df:DataFrame = df_.transform(subset_frame(0, 25))

How to impute NULL values to zero in Spark/Scala

I have a Dataframe in which some columns are of type String and contain NULL as a String value (not as actual NULL). I want to impute them with zero. apparently df.na.fill(0) doesn't work. How can I impute them with zero?
You can use replace() from DataFrameNaFunctions, these can be accessed by the prefix .na:
val df1 = df.na.replace("*", Map("NULL" -> "0"))
You could also create your own udf that replicates this behaviour:
import org.apache.spark.sql.functions.col
val nullReplacer = udf((x: String) => {
if (x == "NULL") "0"
else x
})
val df1 = df.select(df.columns.map(c => nullReplacer(col(c)).alias(c)): _*)
However this would be superfluous given it does the same as the above, at the cost of more lines of code than necessary.

Filter out rows with NaN values for certain column

I have a dataset and in some of the rows an attribute value is NaN. This data is loaded into a dataframe and I would like to only use the rows which consist of rows where all attribute have values. I tried doing it via sql:
val df_data = sqlContext.sql("SELECT * FROM raw_data WHERE attribute1 != NaN")
I tried several variants on this, but I can't seem to get it working.
Another option would be to transform it to a RDD and then filter it, since filtering this dataframe to check if a attribute isNaN , does not work.
I know you accepted the other answer, but you can do it without the explode (which should perform better than doubling your DataFrame size).
Prior to Spark 1.6, you could use a udf like this:
def isNaNudf = udf[Boolean,Double](d => d.isNaN)
df.filter(isNaNudf($"value"))
As of Spark 1.6, you can now use the built-in SQL function isnan() like this:
df.filter(isnan($"value"))
Here is some sample code that shows you my way of doing it -
import sqlContext.implicits._
val df = sc.parallelize(Seq((1, 0.5), (2, Double.NaN))).toDF("id", "value")
val df2 = df.explode[Double, Boolean]("value", "isNaN")(d => Seq(d.isNaN))
df will have -
df.show
id value
1 0.5
2 NaN
while doing filter on df2 will give you what you want -
df2.filter($"isNaN" !== true).show
id value isNaN
1 0.5 false
This works:
where isNaN(tau_doc) = false
e.g.
val df_data = sqlContext.sql("SELECT * FROM raw_data where isNaN(attribute1) = false")